UP | HOME

Transformers

First of all, there are a lot of “memes” - “attention”, “transformer”, etc. which meanings that do not correspond to out intuitive understanding, plus unfortunate terminology of “queries”, “keys”, and “values”.

“Queries” come from general search, while “keys” and “values” are, probably, reminiscent of hash-tables, which are a special case of a generalized, abstract lookup-table.

There are, however, neither queries not lookup-tables within a model. Only matrices of real numbers and their products.

The actual notion is that after matrix-matrix multiplication, some “positions” are “zeroes” (very close to zero) and can be ignored, while the rest “stand out” and thus were “selected” (no one there actually selects or attends to anything, this is just a multiplying by sort of an “/inverse” value).

The general idea is, of course, obvious - to pay attention to the definitive features, just like /eyes on a face or on any animal head. Crows do that - they know the direction where you (as a potential danger) are looking at.

Learning to pay attention only to what changes while “ignoring” what “always” stays the same in an environment is, indeed, a “natural process”, which has been evolved in all life forms - every life form pays attention to what moves, suddenly appears, etc.

Learning to recognize common patterns by its distinct features (when they exist) is the same “naturally evolved” ability. Patterns are out there, so are distinct features.

Technically, what they call “self-attention of a neural network” is just another “layer” (or a matrix) of numbers superimposed on the input. The underlying principle is to “ignore” what is irrelevant, by soft-maxing to zero.

This matrix is being continuously updates, so it eventually “selects” (when multiplied to an input) what seems to be “important” or “relevant”.

This, of course, does not imply anything “intelligent” and corresponds roughly to accumulated statistics of repeated observations - something “changes”, something else does not. Some patterns are common, others aren’t.

So, all the dramatic memes aside, just “another matrix”, which captures what “stands out”, which “selects” when being multiplied by the intput.

This multiplication “transforms” (as a mathematical function does) the input, hence the name.

And this is, basically, it at a higher level.

Author: <schiptsov@gmail.com>

Email: lngnmn2@yahoo.com

Created: 2023-08-08 Tue 18:37

Emacs 29.1.50 (Org mode 9.7-pre)