Learning How to Represent the World

Not long ago I wrote a piece called “Thoughts on Space”. Its point was that space can be seen not as the backdrop against which things are placed, but as a form for efficiently holding countless relationships. LLM embeddings, the hippocampal cognitive map, and the emergent spacetime posited by some modern physics seem unrelated, yet all of them use the form we call space — and I tried to explain why, through the principle of compressing relationships.

Earlier still, in “Art and Virus”, I defined art not as the transmission of information but as a process of revising an internal model. From that view, good art shakes the prior distribution the viewer holds, generates prediction error, and, in the process of resolving that error, makes the viewer update the very rules by which they interpret the world.

I had thought the two pieces dealt with different subjects. But rereading them and mulling it over, I realized that both were illuminating understanding — each in a different way.

We commonly equate understanding with a state of possessing a great deal of knowledge. But when you set information theory, machine learning, and cognitive science side by side, understanding is closer to representing the world more economically than to storing a lot of information. This difference becomes clear in how relationships are handled.

The world can be seen as a set of relationships. When there are N objects, the pairwise relationships among them number N(N−1)/2 — that is, they grow on a scale roughly proportional to N² — and if you store each one in a table, the storage grows at the same rate. But if you represent each object with a d-dimensional coordinate, the number to store shrinks to N×d, and relationships such as the distance or direction between two objects are recovered on the fly through operations on the coordinates. In the end, when d is sufficiently smaller than N, a coordinate system becomes a device that compresses relationships on the order of O(N²) into a representation on the order of O(N).

This kind of compression requires regularity in the relationships. A completely random set of relationships cannot be greatly reduced by any coordinate system. The manifold hypothesis in machine learning holds that this condition is, by and large, met in real data.¹ A 64×64 grayscale image is a single point in a 4096-dimensional space, but actual face images are distributed only near a far lower-dimensional surface (a manifold) governed by a few degrees of freedom such as lighting, angle, and expression; that embeddings succeed is itself empirical evidence that the data has this low-dimensional structure.

LLMs demonstrate this principle vividly. Rather than memorizing words individually, an LLM, in the course of learning to predict the next word, arranges each word as a high-dimensional vector — placing words of similar meaning near one another and aligning similar relationships in similar directions. The reason vector arithmetic like “king − man + woman ≈ queen” holds in word2vec-style models is that relationships such as gender or status have been encoded as fixed directions in the space.²

There is evidence that the brain uses a similar strategy. Place cells in the hippocampus fire when an animal is at a particular location, and grid cells in the entorhinal cortex fire in a regular pattern that tiles space with a hexagonal grid — a discovery that led to the 2014 Nobel Prize in Physiology or Medicine.³

What is striking is that this system is not used only for physical space. In one fMRI study, when people learned an abstract concept composed of two axes — leg length and neck length — the same hexagonally symmetric signal that appears during physical movement was observed in the entorhinal cortex.⁴ This supports the idea that the brain tends to organize concepts as a map with coordinates rather than storing them as a list.

But before concluding that compression simply is understanding, one distinction must be made. The ability to describe what one has observed concisely is not the same as the ability to predict what one has not yet seen. Two representations that compress the same data equally well can behave entirely differently when faced with a new case. One may be merely a summary that happens to fit, while the other may have caught the very structure that generated the data. What we call understanding does not stop at summarizing what was seen; it comes into being only when that summary carries over to what was not seen. The heart of understanding lies not in the size of the compression but in transferable structure.

This distinction takes concrete form in recent work on cognitive maps. Behrens and Whittington (James C. R. Whittington, Timothy E. J. Behrens) and colleagues explained the hippocampal-entorhinal system with a model that learns structure and content separately (the Tolman-Eichenbaum Machine, or TEM).⁵ Here structure is the form of the relationships in which objects are placed, and content is the sensory information that fills each slot. If you keep the two apart, the relational structure learned in one environment can be reused directly in a new environment simply by swapping in new content. The model holds that this is why codes like grid cells recur even when the environment changes. A good representation is not merely a short representation but a reused one. Compression may be a necessary condition for understanding, but it is not a sufficient one; only when the compressed thing can be used again do we call it understanding.

Information theory shows, from another angle and quantitatively, that information and understanding are not the same. Shannon defined the information content of an event x as −log p(x) with respect to its probability p(x) — that is, as surprisal — so that the less probable an event, the more information it carries.⁶ By this definition it follows that the more completely a prediction is missed, the greater the information content — and this is exactly where a paradox arises.

Completely random values have the greatest Shannon information, since each symbol is maximally unpredictable, yet we do not say we understand them. Kolmogorov complexity describes the same phenomenon from the standpoint of computation.⁷ The complexity K(x) of a string is defined as the length of the shortest program that outputs it; a regular string like “010101…” is generated by a short program and so has small K, whereas a fully random sequence has no description shorter than itself and so K equals the length of the string. Thus the least compressible object is the most complex — but that does not make it the best-understood object. On the contrary, understanding holds only where compression is possible.

There is something to clarify here. The compression Kolmogorov speaks of is the length of the program that produces a string — that is, algorithmic compression. By contrast, the compression I spoke of earlier with coordinate systems and embeddings is geometric compression, placing objects in a low-dimensional space. The two use the same word but are not the same thing. The digits of pi compress into a short program yet yield no useful coordinate arrangement, while conversely points on a manifold form a geometric representation even though they are hard to describe with a short program.

This piece has leaned throughout on the latter, geometric metaphor, but the tool that props up the argument that information and understanding differ is the former, algorithmic compression. Having a lot of information and having understood are different statements, and understanding is closer to converting a lot of information into a shorter, reusable representation.

So how can a representation be changed? Predictive coding theory sees the brain as a hierarchical generative model in which higher regions predict the input coming from lower regions, and only the difference between actual input and prediction — the prediction error — is passed upward.⁸ In this framework, perception is not the passive reception of sensation but active inference that reconciles prediction with input, and learning is the process of revising the model so as to reduce that prediction error.

But if the prediction error is merely small, there is no reason to fix the model, and no learning happens. Conversely, if the error is too large, the input has no point of contact with the existing model, and one cannot pin down what to fix or how. Learning happens best in between — in the band where the existing representation falls short of an explanation, but a slight revision of the representation explains several phenomena together.

What is interesting is that this intuition recurs in almost the same shape across three different fields. The psychologist Daniel E. Berlyne, in the experimental aesthetics of the 1970s, saw that the relationship between variables such as novelty, complexity, and uncertainty and the pleasure they produce traces an inverted U.⁹ A stimulus that is too simple and obvious is boring, one too complex and unfamiliar is unpleasant, and pleasure peaks in between. This shape, known as the Wundt curve, was among the first quantitative statements that the point where humans feel attraction coincides with the point where learning is possible.

Decades later this intuition reappears as a far more concrete learning principle in artificial intelligence and robotics. Jürgen Schmidhuber proposed a learning principle that takes as its intrinsic reward not the degree to which an observation is compressed but the amount by which the compression rate improves.¹⁰ Interest and beauty arise not from an object already well compressed (boredom) or one not compressible at all (noise), but at the moment one comes to compress it better than a moment before. Pierre-Yves Oudeyer and colleagues implemented this idea experimentally in developmental robots.¹¹ When they designed the reward so the agent would seek out situations where its predictive ability improves fastest, the agent — though no one instructed it — was drawn to a learnable frontier that was neither too obvious nor too random, and once it had mastered one area it moved on to the next frontier of its own accord.

Arousal potential, compression progress, and learning progress — these three vocabularies ultimately point to a single curve: the inverted U of learnability. That the same shape recurs from the psychology labs of the 1970s to today’s robots suggests it may not be the fashion of one era but the form of the process of learning itself. The moment a rule newly comes into view — the moment one comes to compress the world a little better than a moment before — is the moment understanding is updated and the moment we feel fun and beauty.

Seen through this framework, art can be redescribed more generally than my earlier account of it as an interpolation between information and noise. Good art is neither fully explained by the viewer’s existing representation nor pure noise. A work fully explained by the existing representation produces no prediction error and so triggers no new understanding, while a work close to pure noise offers no structure to reference and so no new representation can be learned. A good work aims at the point where the viewer’s existing representation fails, and through that failure demands a new representation.

Art changes the very way facts are arranged. As once-unrelated objects draw near and once-obvious things recede, the coordinate system on which the viewer places the world is redrawn. This is also why good works endure. That a work keeps operating after viewing is not because one repeatedly recalls its content, but because it makes one interpret subsequent experience on the newly redrawn coordinate system. It is here that the two pieces meet. If “Thoughts on Space” dealt with the process of compression that erects a coordinate system, “Art and Virus” dealt with the process of recompression that makes that coordinate system be redrawn — and with either one alone, learning does not come about.

This piece has leaned throughout on the metaphor of space and coordinate systems, but the metaphor is not the conclusion. Two different claims are in fact overlaid here. One is the claim that understanding is a matter of having a better representation; the other is the claim that this representation often takes the form of space.

The first claim is neutral about the form of the representation. A good representation may be a metric space, but it may also be a relational graph, a hierarchy, or the form of a program. The second claim is not a necessity but only an empirical tendency frequently observed in both humans and machines. This is also what makes the Tolman-Eichenbaum Machine interesting, because the representation it speaks of is less a pure metric coordinate than a structure that has separated the form of relationships from their content. Likewise, Kolmogorov’s compression is defined not by geometry but by the length of a program. So to say we convert the world into space is better read not narrowly as literal geometry but as organizing relationships into a transferable form.

Science looks for a representation that explains a wider range rather than collecting more facts; machine learning embeds data into a more useful representation space; the brain organizes experience into a cognitive map. Art exposes the limits of the existing representation space and proposes a new representation, and some modern physics examines the possibility that even spacetime is a representation that emerged from a more fundamental information structure. These are not the same theory; they do not use the same mathematics or solve the same problem. And yet one intuition recurs: that understanding is not a matter of having more information but of having a better representation.

If so, there is room to turn over the common notion that we live within space. We are closer to beings who live by ceaselessly converting the world into representations. Experience becomes representation, representation gives rise to prediction, prediction fails, and that failure makes a new representation in turn. This cycle is a form common to human learning, the history of science, the training of artificial intelligence, and the effect that art leaves behind. In the end, understanding is not the ability to know more facts but the ability to represent the world better — and from this view the function of good art can be redefined as well. It is not to change the world, but to change the way we represent it.

Fefferman, C., Mitter, S., & Narayanan, H., “Testing the manifold hypothesis”, Journal of the American Mathematical Society 29(4), 983–1049 (2016). ↩
For the result that analogy holds via vector arithmetic, see Mikolov, T., Yih, W., & Zweig, G., “Linguistic Regularities in Continuous Space Word Representations”, NAACL-HLT (2013). For the model architecture, Mikolov, T., Chen, K., Corrado, G., & Dean, J., “Efficient Estimation of Word Representations in Vector Space”, arXiv:1301.3781 (2013). ↩
For place cells, O’Keefe, J., & Dostrovsky, J., “The hippocampus as a spatial map”, Brain Research 34, 171–175 (1971). For grid cells, Hafting, T., Fyhn, M., Molden, S., Moser, M.-B., & Moser, E. I., “Microstructure of a spatial map in the entorhinal cortex”, Nature 436, 801–806 (2005). The 2014 Nobel Prize in Physiology or Medicine was awarded to John O’Keefe and May-Britt and Edvard Moser. ↩
Constantinescu, A. O., O’Reilly, J. X., & Behrens, T. E. J., “Organizing conceptual knowledge in humans with a gridlike code”, Science 352(6292), 1464–1468 (2016). ↩
Whittington, J. C. R., Muller, T. H., Mark, S., Chen, G., Barry, C., Burgess, N., & Behrens, T. E. J., “The Tolman-Eichenbaum Machine: Unifying Space and Relational Memory through Generalization in the Hippocampal Formation”, Cell 183(5), 1249–1263 (2020). ↩
Shannon, C. E., “A Mathematical Theory of Communication”, Bell System Technical Journal 27, 379–423, 623–656 (1948). ↩
Kolmogorov, A. N., “Three Approaches to the Quantitative Definition of Information”, Problems of Information Transmission 1(1), 1–7 (1965). Solomonoff (1964) and Chaitin (1966) independently proposed similar notions. ↩
Rao, R. P. N., & Ballard, D. H., “Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects”, Nature Neuroscience 2(1), 79–87 (1999). Its later extension into hierarchical generative models and the free-energy principle is due to Friston, K. (2005, 2010). ↩
Berlyne, D. E., Aesthetics and Psychobiology, Appleton-Century-Crofts (1971). Its earlier groundwork is Berlyne, D. E., Conflict, Arousal, and Curiosity, McGraw-Hill (1960). ↩
Schmidhuber, J., “Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010)”, IEEE Transactions on Autonomous Mental Development 2(3), 230–247 (2010). ↩
Oudeyer, P.-Y., Kaplan, F., & Hafner, V. V., “Intrinsic Motivation Systems for Autonomous Mental Development”, IEEE Transactions on Evolutionary Computation 11(2), 265–286 (2007). ↩

Comments