What kind of paper is this?
This is primarily a Theory paper, with strong secondary elements of Method and Position.
It is a theoretical work because its core contribution is the formal mathematical derivation of the storage capacity and accuracy of distributed schemes (coarse coding) compared to local schemes. It serves as a position paper by challenging the “grandmother cell” (local representation) intuition prevalent in AI at the time and advocating for the “constructive” view of memory.
What is the motivation?
The motivation is to overcome the inefficiency of local representations, where one hardware unit corresponds to exactly one entity, and to challenge traditional metaphors of memory.
- Inefficiency: In local representations, high accuracy requires an exponential number of units (accuracy $\propto \sqrt[k]{n}$ for $k$ dimensions).
- Brittleness: Local representations lack natural support for generalization; learning a fact about one concept (e.g., “chimps like onions”) requires extra machinery to transfer to similar concepts (e.g., “gorillas”).
- Hardware Mismatch: Massive parallelism is wasted if units are active rarely (1 bit of info per unit active 50% of the time vs. almost 0 for sparse local units).
- The “Filing Cabinet” Metaphor: The paper challenges the standard view of memory as a storage system of literal copies. It motivates a shift toward understanding memory as a reconstructive inference process.
What is the novelty here?
The paper introduces formal mechanisms that explain why distributed representations are superior:
- Coarse Coding Efficiency: Hinton proves that using broad, overlapping receptive fields (“coarse coding”) yields higher accuracy for a fixed number of units than non-overlapping local fields. He derives that accuracy scales linearly with the number of units ($a \propto n$)
- Automatic Generalization: It demonstrates that generalization is an emergent property of vector overlap. Modifying weights for one pattern automatically affects similar patterns (conspiracy effect).
- Memory as Reconstruction: It posits that memory is a reconstructive process where items are created afresh from fragments using plausible inference rules (connection strengths). This blurs the line between veridical recall and confabulation.
- Gradual Concept Formation: Distributed representations allow new concepts to emerge gradually through weight modifications that progressively differentiate existing concepts. This avoids the discrete decisions and spare hardware units required by local representations.
- Solution to the Binding Problem: It proposes that true part/whole hierarchies are formed by fusing the identity of a part with its role to produce a single, new subpattern. The representation of the whole is then the sum of these combined identity/role representations.
What experiments were performed?
The paper performs analytical derivations and two specific computer simulations:
- Arbitrary Mapping Simulation: A 3-layer network trained to map 20 grapheme strings (e.g., words) to 20 unrelated semantic vectors.
- Damage & Recovery Analysis:
- Lesioning: Removing “word-set” units to observe error patterns.
- Noise Injection: Adding noise to weights to simulate “Deep Dyslexia” (semantic errors like reading “PEACH” as “APRICOT”).
- Retraining: Measuring the speed of relearning after damage (“spontaneous recovery”).
What outcomes/conclusions?
- Accuracy Scaling: For a $k$-dimensional feature space, the accuracy $a$ of a distributed representation scales as $a \propto n \cdot r^{k-1}$ (where $r$ is the receptive field radius), vastly outperforming local schemes.
- Reliability: Distributed systems exhibit graceful degradation. Removing units causes slight noise across many items.
- Spontaneous Recovery: When retraining a damaged network on a subset of items, the network “spontaneously” recovers unrehearsed items due to weight sharing, which is a qualitative signature of distributed representations.
- Limitations of Coarse Coding: The paper identifies that coarse coding requires relatively sparse features. Crowding too many feature-points together causes receptive fields to contain too many features, preventing the activity pattern from discriminating between combinations.
Reproducibility Details
The following details are extracted from Section 5 (“Implementing an Arbitrary Mapping”) to facilitate reproduction of the “Deep Dyslexia” and “Arbitrary Mapping” simulation.
Data
The simulation uses synthetic data representing words and meanings.
| Purpose | Dataset | Size | Notes |
|---|---|---|---|
| Training | Synthetic Grapheme/Sememe Pairs | 20 pairs | 20 different grapheme strings mapped to random semantic vectors. |
- Input (Graphemes): 30 total units.
- Structure: Divided into 3 groups of 10 units each.
- Encoding: Each “word” (3 letters) activates exactly 1 unit in each group (sparse binary).
- Output (Sememes): 30 total units.
- Structure: Binary units.
- Encoding: Meanings are random vectors where each unit is active with probability $p=0.2$.
Algorithms
- Learning Rule: The paper cites “Hinton, Sejnowski & Ackley (1984)” (Boltzmann Machines) for the specific learning algorithm used to set weights.
- False Positive Analysis: The probability $f$ that a semantic feature is incorrectly activated is derived as:
$$f = (1 - (1-p)^{(w-1)})^u$$
Where:
- $p$: Probability of a sememe being in a word meaning ($0.2$).
- $w$: Number of words in a “word-set” (cluster).
- $u$: Number of active “word-set” units per word.
Models
The simulation uses a specific three-layer architecture.
| Layer | Type | Count | Connectivity |
|---|---|---|---|
| Input | Grapheme Units | 30 | Connected to all Intermediate units (no direct link to Output). |
| Hidden | “Word-Set” Units | 20 | Fully connected to Input and Output. |
| Output | Sememe Units | 30 | Connected to all Intermediate units. Includes lateral inhibition (implied for “clean up”). |
- Weights: Binary/Integer logic in theoretical analysis, but “stochastic” weights in the Boltzmann simulation.
- Thresholds: Sememe units have variable thresholds dynamically adjusted to be slightly less than the number of active word-set units.
Evaluation
The simulation evaluated the robustness of the mapping.
| Metric | Value | Baseline | Notes |
|---|---|---|---|
| Accuracy (Clean) | 99.9% | N/A | Correct pattern produced 99.9% of time after learning. |
| Lesion Error Rate | 1.4% | N/A | 140 errors in 10,000 tests after removing 1 word-set unit. |
| Semantic Errors | ~60% of errors | N/A | 83 of the 140 errors were “Deep Dyslexia” errors (producing a valid but wrong semantic pattern). |
Hardware
- Compute: Minimal. The original simulation ran on 1980s hardware (likely VAX-11 or similar).
- Replication: Reproducible on any modern CPU in milliseconds.
Paper Information
Citation: Hinton, G. E. (1984). Distributed Representations. Technical Report CMU-CS-84-157. Carnegie-Mellon University.
Publication: CMU Computer Science Department Technical Report, October 1984
@techreport{hinton1984distributed,
title={Distributed representations},
author={Hinton, Geoffrey E},
year={1984},
institution={Carnegie-Mellon University}
}
Additional Resources: