diff --git a/docs/make.jl b/docs/make.jl index 15ccbc19d..95fa48513 100644 --- a/docs/make.jl +++ b/docs/make.jl @@ -31,6 +31,7 @@ makedocs( "Debug Mode" => "debug_mode.md", "Design" => [ "Many Differential Types" => "design/many_differentials.md", + "Zeros and Not Defined" => "design/zeros.md", ], "API" => "api.md", ], diff --git a/docs/src/design/zeros.md b/docs/src/design/zeros.md new file mode 100644 index 000000000..4be7a4302 --- /dev/null +++ b/docs/src/design/zeros.md @@ -0,0 +1,75 @@ +# Design Notes: The many kinds of Zeros and NotDefined's + +There are many zero and not defined line situtions one might want to talk about in the context of differentation. +Not all of them can be generated by autodiff software. +Not all of them are supported by ChainRules. + + +Here is list of some of the examples: + +Differentials are roughly a vector-space -- they support scaling. +So there is at least 1 scalar zero that must be supported. +But one might want an extra zero that can resolved at compile time based on type to completely avoid certain computations e.g. `unthunk`ing `Thunks`. +(Or one might want to bake that into the notion of `*::(x, t::AbstractThunk) = iszero(x) ? ...`). + +Which brings us to a second zero: +the zero that represents the output of a scalar zero times a thunk, that avoids unthunking. + + +There is the zero that is `f'(5)` for ``f(x) = (x-5)^2`` a good clear zero. + +There is not-definedness of the solution to `f'(5)` for ``f(x) = abs(x-5)``, +where the limit from the left is not equal to the limit from the right, +but where the range of values enclosed by those limits include a zero, +in this case ``\lim_{x\to5^{+}}f'(x)\le0\le\lim_{x\to5^{-}}f'(x)``. +This is an interesting zero/not-defined because it matters for purposes of optimization. +It is a location of a local minima. +`relu` and `x->clamp(x, a, b)` are other functions with this kind of zero/not-definedness. +See [Subgradient](https://en.wikipedia.org/wiki/Subgradient_method) for more on that. + +Conversely, there is the not-definedness of the solution to: `f'(5)` for ``f(x)=\begin{cases} +2(x-5) & x\le5\\ +3(x-5) & x\ge5 +\end{cases}`` + which is not interesting, because it can't be a local minima. + + +There is the not-definedness of the solution to `f'(5)` for ``f(x)=\dfrac{(x-5)^4}{(x-5)^2}``, +where there is a removable point discontinuality but that the limit from each side is zero, and thus it is a location of a local minima (or each side of it is if you like). +And there is the less interesting case where it is nonzero on each side. +And this can be stacked with the limit differing cases mentioned earlier, so the primal function is not defined and the limit from each side does not agree but either encloses or does not enclose zero. + + +There is the zero that is `\dfrac{\partial f}{\partial a}` for ``f(a,b)=2b``. +This one is particular important I feel in source to source AD. +It represents a disconnection in the computational graph, there is no path from input ``a`` to the output ``f(a,b)``. +This one can also show up dynamically, but perhaps that should be considered a different case. +For example in `max(a,b)` or in ``ifelse(cond, a, b)``. +Have has a few talks with [James Bradbury](https://github.com/jekbradbury) about this, apparently it is important this this is a strong-zero, like julia's `false` where `false*NaN=false` not `NaN`. +This is the subject of TensorFlow's _double where trick_, (`where` is what `ifelse` is called in TensorFlow) as they do not have a strong-zero. +If a gradient being propagated backwards from a branch that was not taken is `NaN`, and thus the `ifelse` has this disconnected zero, then when the chainrule is applied it is required that this zero remains zero (not `NaN`). +I have not seen a good writeup on this, apparently one exists somewhere in the TensorFlow issue tracker. + +There is the zero/not-definedness for something where perturbing its value is an error. +So this is the gradient of `f'(5)` for ``f(x) = [1,2,3,4][x]``. +As small perturbation to this is an error, e.g. `f(5.1)` is not defined. +Related to that is where the notion of perterbing is not defined. +This is the case is for inputs that are `String`s or `Symbol`s. + +There is the cases of a structural Zero in a sparse data structure. +Like the off-diagonal on a `DiagonalMatrix`. +Also the structural zero of a `SparseCSC` that varies at run time, particularly relevant in that it can be the result from the derivative of `getindex`. +As well as the zero that could be within the differential representing a tuple +if it is ``f(x::Tuple{Float64,Float64,Float64,}) = x[1] + x[3]`` +then derivative is `Composite{Tuple}(1, Zero(), 1)` and that is a structual zero. + +Derivative with repect to empty things. +They have no value so can not be perturbed. +For example the gradient with respect to a empty array or tuple. +Also with respect to an struct that has no fields. +The struct case is interesting as a struct without fields is a singleton +(technically a `mutable struct` isn't but it might as well be). +It is the only the only element of its type. +A very common case of this is functions. +Every function in julia is a singlton struct, with call overloading. +This is ChainRules's `Δself` that shows up in pullbacks and pushforward -- it is this kind of zero whenver