Zooming through the conceptual layers

This is, arguably, the most important skills in programming, so lets exercise it!

What conceptually most of the language models do? They calculate (and then sample from) probability distributions for all the sequences of words they have been trained on.

So, this is the Bayesian rule, Markov models, etc. This is exactly where we could stop – conceptually, there is nothing, but a probabilistic model, with a bunch of well-known “tricks” – well-known for not being anything but probabilities.

What operationally most of the languages models do? They produce a binary artifact, which is a representation of an abstract multi-dimensional vector-space in which so-called “word-embeddings” are “topologically sorted” (by an abstract distance or “proximity”).

This is a less-wrong operational definition. It is indeed a “sorting” problem, but instead of a linear sequence (in a 1D, linear “space”) we “sort” (or cluster up) in an abstract nD space.

The actual “distances” are encoded (embedded into the structure) as an additional layer of weights, which are being “updated” (destructively) with each new iteration of the training process.

So this is not the usual graph of “gradients” (the actual weights) within an “algebraic expression”, it is an additional component “super-imposed” onto it, or an “index”.

By updating this “positional encoding” we “sort” the “space” of “word embeddings”. The whole thing therefore is a form of “indexing”. Too many quotes, huh?

We index on abstract “distances” based on frequencies of “observed” linear sequences of words and sentences in the training set.

Notice that there is literally no place for any form “intelligence” to “emerge” so far.

Math

First of all, “hypercubes” and “hyperplanes” are just mathematical memes. They are our abstract interpretations of certain formulas and transformations.

The formulas and corresponding transformations are sound, while our interpretations of them – the names we call and concepts we use –, are bullshit.

Addition (putting together) and scaling (multiplication by a Real) and weighted sums in general, are “universal notions”, so are the graphs of these, and partial differentiation of these graphs.

A river, which at its source are multitude of individual streams with different flow rates is what your parital differentiation of a tree-shaped graph is. Universal.

First of all. there is no such thing as dimensions outside our heads. The fact that a motion (and forces) in a psysical space can be modeled using 3-component vectors does not justify extension of this method to arbirary abstractions.

Another fact, that in a resulting model the motions against each access are independent from each other is the fact about this Universa, and has no other implications.

And no, ti time IS NOT another, “orthogonal” dimension. It is an unrelated abstract concept, and by definition is not orthogonal in the first place. It is completely unrelared.

Last but not least, a cube (or a sphere) is only and only is within 3 “dimensions”. Something with more numbers in a vector is a different, unrelated concept.

Using component-based addition and scaling produces mathematically correct abstract (numerical) result, but the meaning (semantics) of this result is arbitrary and dependent on a particular set of assumptions.

The models use lots of well-understood mathematical techniques, which creates an illusion of enormous sophistication, where in fact it is just complexity.

partial differentiation (the actual mathematical implementation of back-propagation)
linear algebra (including so-called dimensionality reductions)
the Bayesian rule (as a measure of “belief”)
etc.

Combining (or rather mixing-and-matching) these techniques without understanding does not produce any “intelligence”. In fact, it is related to the Russell paradox – to the unrestricted use of application of anything to anything else.

The result is abstract, Hegelian-like bullshit, but mathematically sound (each individual operation is sound).

The results

The ability to calculate probabilities of all seen so far sentences (legal combinations of words) has nothing to do with the meaning (the actual “deep structure”).

This could be conceptually related to “the whole language-using community’s set of beliefs” (according to the interpretation of Bayesian statistics) but this, again, are just current memes. Beliefs are not knowledge (which has been shown sinse beginning of time).

This is especially evident with application of these training and generative techniques to a source code.

There is no notion of principle-guided building of layers of optimal abstractions from the ground up. It is just probabilities of next symbols in a row.