Paper Information

Citation: Hinton, G. E. (1984). Distributed Representations. Technical Report CMU-CS-84-157. Carnegie-Mellon University.

Publication: CMU Computer Science Department Technical Report, October 1984

What kind of paper is this?

This is primarily a Theory ($\Psi_{\text{Theory}}$) paper, with strong secondary elements of Method ($\Psi_{\text{Method}}$) and Position ($\Psi_{\text{Position}}$).

It is a theoretical work because its core contribution is the formal mathematical derivation of the storage capacity and accuracy of distributed schemes (coarse coding) compared to local schemes. It serves as a position paper by arguing against the “grandmother cell” (local representation) intuition prevalent in AI at the time, advocating instead for the “constructive” view of memory.

What is the motivation?

The motivation is to overcome the inefficiency of local representations, where one hardware unit corresponds to exactly one entity.

  • Inefficiency: In local representations, high accuracy requires an exponential number of units (accuracy $\propto \sqrt[k]{n}$ for $k$ dimensions).
  • Brittleness: Local representations do not naturally support generalization; learning a fact about one concept (e.g., “chimps like onions”) does not automatically transfer to similar concepts (e.g., “gorillas”) without extra machinery.
  • Hardware Mismatch: Massive parallelism is wasted if units are only active rarely (1 bit of info per unit active 50% of the time vs. almost 0 for sparse local units).

What is the novelty here?

The paper introduces formal mechanisms that explain why distributed representations are superior:

  1. Coarse Coding Efficiency: Hinton proves that using broad, overlapping receptive fields (“coarse coding”) yields higher accuracy for a fixed number of units than non-overlapping local fields. He derives that accuracy scales linearly with the number of units ($a \propto n$) rather than by roots.
  2. Automatic Generalization: It demonstrates that generalization is an emergent property of vector overlap. Modifying weights for one pattern automatically affects similar patterns (conspiracy effect).
  3. Solution to the Binding Problem: It proposes a “role/identity” mechanism to solve the binding problem (how to keep “John’s father” distinct from “John”) by combining identity vectors with role vectors.
Diagram showing distributed representations with three pools of units (AGENT, RELATIONSHIP, PATIENT) connected via role/identity bindings
The binding problem solution: separate pools of distributed units represent different roles (AGENT, RELATIONSHIP, PATIENT), with connections showing how combinations of units encode specific bindings between entities and their roles.

What experiments were performed?

The paper performs analytical derivations and two specific computer simulations:

  1. Arbitrary Mapping Simulation: A 3-layer network trained to map 20 grapheme strings (e.g., words) to 20 unrelated semantic vectors.
  2. Damage & Recovery Analysis:
    • Lesioning: Removing “word-set” units to observe error patterns.
    • Noise Injection: Adding noise to weights to simulate “Deep Dyslexia” (semantic errors like reading “PEACH” as “APRICOT”).
    • Retraining: Measuring the speed of relearning after damage (“spontaneous recovery”).

What outcomes/conclusions?

  1. Accuracy Scaling: For a $k$-dimensional feature space, the accuracy $a$ of a distributed representation scales as $a \propto n \cdot r^{k-1}$ (where $r$ is the receptive field radius), vastly outperforming local schemes.
  2. Reliability: Distributed systems exhibit graceful degradation. Removing units causes slight noise across many items rather than the catastrophic loss of single items.
  3. Spontaneous Recovery: When retraining a damaged network on a subset of items, the network “spontaneously” recovers unrehearsed items due to weight sharing - a qualitative signature of distributed representations.

Reproducibility Details

The following details are extracted from Section 5 (“Implementing an Arbitrary Mapping”) to facilitate reproduction of the “Deep Dyslexia” and “Arbitrary Mapping” simulation.

Data

The simulation uses synthetic data representing words and meanings.

PurposeDatasetSizeNotes
TrainingSynthetic Grapheme/Sememe Pairs20 pairs20 different grapheme strings mapped to random semantic vectors.
  • Input (Graphemes): 30 total units.
    • Structure: Divided into 3 groups of 10 units each.
    • Encoding: Each “word” (3 letters) activates exactly 1 unit in each group (sparse binary).
  • Output (Sememes): 30 total units.
    • Structure: Binary units.
    • Encoding: Meanings are random vectors where each unit is active with probability $p=0.2$.

Algorithms

  • Learning Rule: The paper cites “Hinton, Sejnowski & Ackley (1984)” (Boltzmann Machines) for the specific learning algorithm used to set weights.
  • False Positive Analysis: The probability $f$ that a semantic feature is incorrectly activated is derived as:

$$f = (1 - (1-p)^{(w-1)})^u$$

Where:

  • $p$: Probability of a sememe being in a word meaning ($0.2$).
  • $w$: Number of words in a “word-set” (cluster).
  • $u$: Number of active “word-set” units per word.

Models

The simulation uses a specific three-layer architecture.

LayerTypeCountConnectivity
InputGrapheme Units30Connected to all Intermediate units (no direct link to Output).
Hidden“Word-Set” Units20Fully connected to Input and Output.
OutputSememe Units30Connected to all Intermediate units. Includes lateral inhibition (implied for “clean up”).
  • Weights: Binary/Integer logic in theoretical analysis, but “stochastic” weights in the Boltzmann simulation.
  • Thresholds: Sememe units have variable thresholds dynamically adjusted to be slightly less than the number of active word-set units.

Evaluation

The simulation evaluated the robustness of the mapping.

MetricValueBaselineNotes
Accuracy (Clean)99.9%N/ACorrect pattern produced 99.9% of time after learning.
Lesion Error Rate1.4%N/A140 errors in 10,000 tests after removing 1 word-set unit.
Semantic Errors~60% of errorsN/A83 of the 140 errors were “Deep Dyslexia” errors (producing a valid but wrong semantic pattern).

Hardware

  • Compute: Minimal. The original simulation ran on 1980s hardware (likely VAX-11 or similar).
  • Replication: Reproducible on any modern CPU in milliseconds.