Neural Networks

We know the inputs and know the result and know the mathematical expression.

Every mathematical expression can be turned into an Abstract Syntax Tree.

A Tree here is crucial - this is how the individual “arrows” are arranged.

The Arrows “arise” from currying of binary functions, which corresponds to the common infix binary operators, such as $+$ or $*$ .

An AST shows how values “flow through” an expression. There is always a direction in every evaluation.

In addition to values of the parameters, an AST of a partially (or fully) evaluated expression may hold (memoize) some intermediate results, including derivatives of the function with respect to each of the values.

As long as everything in the representation of an expression is referentially transparent and immutable the math will “work”.

To change the outputs without changing the inputs of a nested expression (the nesting of at least one level deep is required), the only way is to scale the parameters of the functions. The functions themselves are, of course, immutable.

This is how we a representation of a complex mathematical expression “learns” - the “weights” of the arguments are being updated.

The process of updating of the wights can be done in a systematic way, using the notion of a partial derivative of a binary (curried) function with respect to each of its parameters.

Notice that the derivative is defined as the slope to a function at a particular point.

The notion of a differentiable function includes the notion of continuity (of having no gaps or kinks).

In the mathematical expressions which represent neural nets we do not have arbitrary functions, on the contrary, we have functions of very particular kind.

addition (putting together)
negation
adding of a negative number
scaling (multiplying)
adding to itself n times (repeated addition)
taking apart into a number of equal parts
rising into a power (exponentiation)
division as rising into a negative power
logarithms and recipricals

We intentionally de-generalize multiplication and division.

Out of these functions the nodes of an AST are being formed. The nodes then are “decorated” with additional intermediate results (values).

Another fundamental kind of operations being used is a structural transformation of the values, without losing or discarding of any information.

Such operations must be reversible, which imply no information loss.

A function must be differentiable, which means to have a non-vertical tangent line at every point. So, $y = C o n s t a n t$ (collapse to a single value) are ok, but $x = C o n s t a n t$ (just a constant) are not.