Bayesian Inference and Learning.txt

  Statistical learning frames models as distributions over data and latent variables, allowing models to address a broad array of downstream tasks. The underlying methodology of latent variable models is typically Bayesian.
  A central problem involves modeling complex data-sets using highly flexible families of probability distributions in which learning, sampling, inference, and evaluation are still analytically or computationally tractable.


  * overview
  * probability and statistics
  * bayesian machine learning
  * bayesian deep learning
  * theory
  * graphical models
  * non-parametric models
  * expectation maximization
  * variational inference
  * sampling
  * likelihood-free inference
  * causal inference
  * interesting quotes
  * interesting papers


[overview]

  Zoubin Ghahramani - "Future Directions in Probabilistic Machine Learning" - https://youtube.com/watch?v=Y3obG7F1crw
  Zoubin Ghahramani - "Probabilistic Machine Learning - Foundations and Frontiers" -
	http://research.microsoft.com/apps/video/default.aspx?id=259579 + https://gridworld.wordpress.com/2015/12/08/nips-2015-posner-lecture-zoubin-ghahramani/

  Dmitry Vetrov - "Introduction to Bayesian Framework" (in russian) - https://youtu.be/lkh7bLUc30g?t=24m25s
  Boris Yangel - "Probabilistic Programming" (in russian) - http://youtube.com/watch?v=ZHERrzVDTiU

  Neil Lawrence - "Deep Learning: Efficiency is the Driver of Uncertainty" - http://inverseprobability.com/2016/03/04/deep-learning-and-uncertainty
  Jacob Andreas - "A neural network is a monference, not a model" - http://blog.jacobandreas.net/monference.html

  Josh Tenenbaum - "Development of Intelligence: Bayesian Inference" - http://youtube.com/watch?v=icEdI0AIOlU

  Marcus Hutter - "Foundations of Machine Learning and Universal Artificial Intelligence" (general bayesian approach to machine learning) -
	http://videolectures.net/ssll09_hutter_uai/ + http://videolectures.net/mlss08au_hutter_fund/

  E.T. Jaynes - "How Does the Brain Do Plausible Reasoning" - http://bayes.wustl.edu/etj/articles/brain.pdf


[probability and statistics]

  "Probability is the representation of uncertain or partial knowledge about the truth of statements."
  "Logical inference is about what is certain to be true. Statistical inference is about what is likely to be true."
  "How do you extend classical logic to reason with uncertain propositions? Suppose we agree to represent degrees of plausibility with real numbers, larger numbers indicating greater plausibility. If we also agree to a few axioms to quantify what we mean by consistency and common sense, there is a unique and inevitable system for plausible reasoning that satisfies the axioms, which is probability theory. And this has been proven over 60 years ago. The important implication is that all other systems of plausible reasoning - fuzzy logic, neural networks, artificial intelligence, etc. - must either lead to the same conclusions as probability theory, or violate one of the axioms used to derive probability theory."


  http://johndcook.com/blog/probability-modeling/
  http://johndcook.com/blog/2012/04/19/random-is-as-random-does/

  Chris Bishop - http://blogs.technet.com/b/machinelearning/archive/2014/10/22/embracing-uncertainty-the-role-of-probabilities.aspx
                 http://blogs.technet.com/b/machinelearning/archive/2014/10/30/embracing-uncertainty-probabilistic-inference.aspx

  "Data Analysis Recipes: Probability Calculus for Inference" - https://arxiv.org/abs/1205.4446

  http://johndcook.com/blog/2008/03/19/plausible-reasoning/

  http://zinkov.com/posts/2015-06-09-where-priors-come-from/

  https://quora.com/What-is-the-most-important-mathematical-concept-in-statistics


  "Probabilistic Modelling" by Iain Murray -
	https://youtube.com/watch?v=pOtvyVYAuW4 + https://youtube.com/watch?v=khagz6yWL9w + https://youtube.com/watch?v=U1IAMQYZjfw

  http://www.brera.mi.astro.it/~andreon/inference/Inference.html

  "Information Theory for Machine Learning" - https://github.com/mtomassoli/papers/blob/master/inftheory.pdf

  http://matthias.vallentin.net/probability-and-statistics-cookbook/cookbook-en.pdf

  selected papers and books on probability and statistics - https://dropbox.com/sh/ff6xkunvb9emlc1/AAA3SCZx5kvdr1BlYq9ArEaka


  course by Joe Blitzstein - https://youtube.com/playlist?list=PLCzY7wK5FzzPANgnZq5pIT3FOomCT1s36
  "Bayesian Statistics" - https://coursera.org/learn/bayesian/
  "Математическая статистика" - https://compscicenter.ru/courses/math-stat/2015-spring/ (in russian)


  definitions of probability by Kolmogorov and Jaynes - https://youtube.com/watch?v=Ihud7yG2iKs (in russian)
  definition of randomness in algorithmic information theory - http://youtube.com/watch?v=X0Lo5IWLjko (in russian)


[bayesian machine learning]

  "In Bayesian machine learning, all learning follows from two rules of probability: sum rule and product rule."
  "From a Bayesian point of view, we should be integrating over likelihoods instead of using optimization methods to select a point estimate of model parameters, usually with ad hoc regularization tuned by cross validation."
  "Bayesian reasoning provides a powerful approach for information integration, inference and decision making that has established it as the key tool for data-efficient learning, uncertainty quantification and robust model composition."


  https://slackprop.wordpress.com/2016/08/28/the-three-faces-of-bayes/

  http://fastml.com/bayesian-machine-learning/

  https://reddit.com/r/MachineLearning/comments/3x470a/why_are_bayesian_methods_considered_more_elegant/


  difference between bayesian and frequentist expected loss - https://en.wikipedia.org/wiki/Loss_function#Expected_loss


  Nando de Freitas - "Bayesian Learning" -
	https://youtube.com/watch?v=7192wm3NWSY + https://youtube.com/watch?v=hhKFa12y0Iw
	https://youtube.com/watch?v=Fae0j1WN1zA + https://youtube.com/watch?v=2KXoC6Dxhxs

  Dmitry Vetrov - "Introduction to Bayesian Framework" (in russian) - https://youtube.com/watch?v=ftlbxFypW74 + https://youtu.be/lkh7bLUc30g?t=24m25s
  Dmitry Vetrov - "Introduction to Bayesian Machine Learning" (in russian) - http://youtube.com/watch?v=sZxE-BrSMAE
  Dmitry Vetrov - "Bayesian Inference and Latent Variable Models in Machine Learning" - http://youtube.com/watch?v=p08Yh1OHkqk + http://youtube.com/watch?v=okL04cuP2mo

  Chris Bishop - "Introduction to Bayesian Inference" - http://videolectures.net/mlss09uk_bishop_ibi/

  Zoubin Ghahramani - "Bayesian Inference" -
	https://youtube.com/watch?v=kjo9Y_Vrgn4 + https://youtube.com/watch?v=yzNbaAPKXA8 + https://youtube.com/watch?v=H7AMB0oo__4
	http://webdav.tuebingen.mpg.de/mlss2013/2015/slides/ghahramani/lect1bayes.pdf
	http://webdav.tuebingen.mpg.de/mlss2013/2015/slides/ghahramani/gp-neural-nets15.pdf
	http://webdav.tuebingen.mpg.de/mlss2013/2015/slides/ghahramani/mlss15future.pdf

  Bob Carpenter - "Bayesian Inference and MCMC" - https://youtube.com/watch?v=qQFF4tPgeWI


  Radford Neal - "Bayesian Methods for Machine Learning" - http://www.cs.toronto.edu/~radford/ftp/bayes-tut.pdf
  Michael Tipping - "Bayesian Inference: An Introduction to Principles and Practice in Machine Learning" - http://miketipping.com/papers/met-mlbayes.pdf

  notes - http://frnsys.com/ai_notes/machine_learning/bayesian_learning.html

  course by Dmitry Vetrov (in russian) - https://youtube.com/playlist?list=PLlb7e2G7aSpR8mbaShVBods-hGaFGifkl
  course by Dmitry Vetrov and Dmitry Kropotov (in russian) - http://machinelearning.ru/wiki/images/e/e1/BayesML-2007-textbook-1.pdf + http://machinelearning.ru/wiki/images/4/43/BayesML-2007-textbook-2.pdf

  http://bayesian-inference.com/bayesian
  http://metacademy.org/roadmaps/rgrosse/bayesian_machine_learning

  http://nbviewer.ipython.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter1_Introduction/Chapter1.ipynb
  https://plot.ly/ipython-notebooks/computational-bayesian-analysis/

  "A reading list on Bayesian methods" - https://cocosci.berkeley.edu/tom/bayes.html


  Michael Jordan - "Bayesian or Frequentist, Which Are You?" - http://videolectures.net/mlss09uk_jordan_bfway/
  Zoubin Ghahramani - "Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?" - http://videolectures.net/bark08_ghahramani_samlbb/


  probabilistic programming - https://github.com/brylevkirill/notes/blob/master/Probabilistic%20Programming.md


[bayesian deep learning]

  Shakir Mohamed - "Bayesian Reasoning and Deep Learning in Agent-based Systems"
	https://youtube.com/watch?v=AggqBRdz6C
  Shakir Mohamed - "Principles and Applications of Deep Generative Models"
	http://videolectures.net/deeplearning2016_mohamed_generative_models/
  Shakir Mohamed - "Bayesian Reasoning and Deep Learning"
	http://blog.shakirm.com/wp-content/uploads/2015/10/Bayes_Deep.pdf

  Dmitry Vetrov - "Bayesian Inference and Deep Learning" (in russian)
	https://youtu.be/kFe5zSkro0E?t=17m16s
	https://youtu.be/_qrHcSdQ2J4?t=22m31s
	https://youtube.com/watch?v=BKh7nj5SmnI
	https://youtu.be/0q5p7xP4cdA?t=5h4m9s

  Shakir Mohamed - "A Statistical View of Deep Learning":
	"Recursive GLMs" - http://blog.shakirm.com/2015/01/a-statistical-view-of-deep-learning-i-recursive-glms/
	"Auto-encoders and Free Energy" - http://blog.shakirm.com/2015/03/a-statistical-view-of-deep-learning-ii-auto-encoders-and-free-energy/
	"Memory and Kernels" - http://blog.shakirm.com/2015/04/a-statistical-view-of-deep-learning-iii-memory-and-kernels/
	"Recurrent Nets and Dynamical Systems" - http://blog.shakirm.com/2015/05/a-statistical-view-of-deep-learning-iv-recurrent-nets-and-dynamical-systems/
	"Generalisation and Regularisation" - http://blog.shakirm.com/2015/05/a-statistical-view-of-deep-learning-v-generalisation-and-regularisation/
	"What is Deep?" - http://blog.shakirm.com/2015/06/a-statistical-view-of-deep-learning-vi-what-is-deep/

  Yarin Gal - "What my deep model doesn't know" - http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html
  Yarin Gal - "Uncertainty In Deep Learning" - http://mlg.eng.cam.ac.uk/yarin/blog_2248.html

  Zoubin Ghahramani - "A History of Bayesian Neural Networks" - http://bayesiandeeplearning.org/slides/nips16bayesdeep.pdf

  David MacKay - "Course on Information Theory, Pattern Recognition, and Neural Networks" - http://videolectures.net/course_information_theory_pattern_recognition/
  David MacKay - "Bayesian Methods for Adaptive Models" - http://www.inference.phy.cam.ac.uk/mackay/thesis.pdf


[books]

  Chris Bishop - "Pattern Recognition and Machine Learning" - https://dropbox.com/s/pwtiuqs27lblvjz/Bishop%20-%20Pattern%20Recognition%20and%20Machine%20Learning.pdf
  Kevin Murphy - "Machine Learning - A Probabilistic Perspective" - https://dropbox.com/s/jdly520i5irx1h6/Murphy%20-%20Machine%20Learning%20-%20A%20Probabilistic%20Perspective.pdf
  Daphne Koller, Nir Friedman - "Probabilistic Graphical Models: Principles and Techniques" - https://dropbox.com/s/cc3mafx3wp0ad1t/Daphne%20Koller%20and%20Nir%20Friedman%20-%20Probabilistic%20Graphical%20Models%20-%20Principles%20and%20Techniques.pdf
  David Barber - "Bayesian Reasoning and Machine Learning" - http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=Brml.Online
  David MacKay - "Information Theory, Inference and Learning Algorithms" - http://www.inference.phy.cam.ac.uk/mackay/itila/book.html
  Art Owen - "Monte Carlo theory, methods and examples" - http://statweb.stanford.edu/~owen/mc/
  E.T. Jaynes - "Probability Theory: The Logic of Science" - https://dropbox.com/s/pt5tpm9i5wofbl5/Jaynes%20-%20Probability%20Theory%20-%20The%20Logic%20of%20Science.pdf
  Judea Pearl - "Probabilistic Reasoning in Intelligent Systems"
  Judea Pearl - "Causality: Models, Reasoning, and Inference"
  David Danks - "Unifying the Mind" - http://mitpress.mit.edu/books/unifying-mind


[theory]

  "Model-based statistics assumes that the observed data has been produced from a random distribution or probability model. The model usually involves some unknown parameters. Statistical inference aims to learn the parameters from the data. This might be an end in itself - if the parameters have interesting real world implications we wish to learn - or as part of a larger workflow such as prediction or decision making. Classical approaches to statistical inference are based on the probability (or probability density) of the observed data y0 given particular parameter values θ. This is known as the likelihood function, π(y0|θ). Since y0 is fixed this is a function of θ and so can be written L(θ). Approaches to inference involve optimising this, used in maximum likelihood methods, or exploring it, used in Bayesian methods.
  A crucial implicit assumption of both approaches is that it’s possible and computationally inexpensive to numerically evaluate the likelihood function. As computing power has increased over the last few decades, there are an increasing number of interesting situations for which this assumption doesn’t hold. Instead models are available from which data can be simulated, but where the likelihood function is intractable, in that it cannot be numerically evaluated in a practical time.
  In Bayesian approach to inference a probability distribution must be specified on the unknown parameters, usually through a density π(θ). This represents prior beliefs about the parameters before any data is observed. The aim is to learn the posterior beliefs resulting from updating the prior to incorporate the observations. Mathematically this is an application of conditional probability using Bayes theorem: the posterior is π(θ|y0)=kπ(θ)L(θ), where k is a constant of proportionality that is typically hard to calculate. A central aim of Bayesian inference is to produce methods which approximate useful properties of the posterior in a reasonable time."


advantages of bayesian methods
  - learn from limited, noisy, missing data
  - deal with small sample size
  - marginalize over latent variables
  - compute error bars
  - establish causal relationships
  - produce explanations for decisions
  - integrate knowledge

applications of bayesian methods
  - data-efficient learning
  - exploration
  - relational learning
  - semiparametric learning
  - hypothesis formation
  - causal reasoning
  - macro-actions and planning
  - visual concept learning
  - world simulation
  - scene understanding

future directions of bayesian methods
  - probabilistic programming languages
  - bayesian optimization
  - rational allocation of computational resources
  - efficient data compression
  - automating model discovery and experimental design


bayesian inference
  A major difference between frequentist and Bayesian approaches in machine learning practice: A frequentist approach would produce a point estimate θ^ from Data, and predict with p(x|θ^). In contrast, the Bayesian approach needs to integrate over different θs. In general, this integration is intractable and hence Bayesian machine learning has been focused on either finding special distributions for which the integration is tractable, or finding efficient approximation.

  - very easy to add new evidence progressively, the old result becomes the new prior
  - one can seamlessly use the results in decision making
  - results are not confusing - most people cannot tell what frequentist P=.06 means
  - confidence intervals mean what one thinks they mean
  - seamlessly avoid the bias variance problem that plagues frequentist statistics
  - does not throw away information - uses all the data one throws at it
  - many if not most frequentist methods are actually equivalent to a Bayesian result with a covert prior
  - once cannot actually get away from having a view on the prior

  - bayesian inference (dynamic programming, variational Bayes, Markov Chain Monte Carlo)
  - approximate inference (belief propagation and variational approximations)
  - MAP/ML estimation (Expectation Maximization, conjugate and projected gradient methods)

  * MAP estimates can be computed in several ways:
    - analytically, when the mode(s) of the posterior distribution can be given in closed form. This is the case when conjugate priors are used.
    - via numerical optimization such as the conjugate gradient method or Newton's method. This usually requires first or second derivatives, which have to be evaluated analytically or numerically.
    - via modification of an expectation-maximization algorithm. This does not require derivatives of the posterior density.
    - via Monte Carlo method using simulated annealing
  * posterior mean or median with credible intervals

  variational inference: fast convergence, but estimates are fundamentally lower-bounded by some inaccuracy due to variational approximation. (KLqp, KLpq, MAP/Laplace)
  MCMC inference: slow convergence, but estimates can become arbitrarily close to true posterior given long enough computation-time. (Metropolis-Hastings, Hamiltonian Monte-Carlo, Stochastic Gradient Langevin Dynamics)


myths and misconceptions about bayesian methods
  - bayesian methods make assumptions where other methods don't (all methods make assumptions otherwise it's impossible to predict, but bayesian methods are transparent in their assumptions whereas other methods are often opaque)
  - if you don't have the right prior you won't do well (no such thing as the right prior, choose vague priors such as nonparametrics when in doubt)
  - Maximum A Posteriori (MAP) is a bayesian method (MAP is similar to regularization, the key in bayesian methods is to average over uncertain variables and parameters rather than to optimize)
  - bayesian methods don't have theoretical guarantees (frequentist style generalization error bounds such as PAC-Bayes can be applied, it is often possible to prove convergence, consistency and rates)
  - bayesian methods are generative (can be used for both generative and discriminative learning such as in gaussian process classification)
  - bayesian methods don't scale well (variational inference and MCMC can scale to very large datasets but averaging/integration is indeed more expensive than optimization)
  - as the data set grows infinitly Bayes converges to maximum likelihood, prior washes out, integration becomes unnecessary (this assumes we want to learn a fixed simple model from infinitely large iid data points, while big data is more like large set of little data sets having structure to be learned, such as for per-user recommendations with millions of users or hundreds of thousands of items where there is little data points for each user/item and no data for new ones, and we would really like to learn models in which number of parameters grows with the size of the data set such as in nonparametric models and in to learn hierarchical models such as combination of models for particular users and model for person in general)


challenges:
  - inference in probabilistic programming systems and broad model families:
        Gradient-based methods for parameter estimation, variational inference
        Metropolis-Hastings variants with efficient rescoring
        Message passing variants
        Sequential Monte Carlo variants
  - learning to infer using discriminative methods to amortize probabilistic inference:
        Variational Autoencoders
        Deep Latent Gaussian Models
        Restricted Boltzmann Machines
        Neural Network based MCMC proposals


(E. T. Jaynes) "The traditional ‘frequentist’ methods which use only sampling distributions are usable and useful in many particularly simple, idealized problems; however, they represent the most proscribed special cases of probability theory, because they presuppose conditions (independent repetitions of a ‘random experiment’ but no relevant prior information) that are hardly ever met in real problems. This approach is quite inadequate for the current needs of science. In addition, frequentist methods provide no technical means to eliminate nuisance parameters or to take prior information into account, no way even to use all the information in the data when sufficient or ancillary statistics do not exist. Lacking the necessary theoretical principles, they force one to ‘choose a statistic’ from intuition rather than from probability theory, and then to invent ad hoc devices (such as unbiased estimators, confidence intervals, tail-area significance tests) not contained in the rules of probability theory. Each of these is usable within the small domain for which it was invented but, as Cox’s theorems guarantee, such arbitrary devices always generate inconsistencies or absurd results when applied to extreme cases.
  All of these defects are corrected by use of Bayesian methods, which are adequate for what we might call ‘well-developed’ problems of inference. As Jeffreys demonstrated, they have a superb analytical apparatus, able to deal effortlessly with the technical problems on which frequentist methods fail. They determine the optimal estimators and algorithms automatically, while taking into account prior information and making proper allowance for nuisance parameters, and, being exact, they do not break down – but continue to yield reasonable results – in extreme cases. Therefore they enable us to solve problems of far greater complexity than can be discussed at all in frequentist terms. All this capability is contained already in the simple product and sum rules of probability theory interpreted as extended logic, with no need for – indeed, no room for – any ad hoc devices.
  Before Bayesian methods can be used, a problem must be developed beyond the ‘exploratory phase’ to the point where it has enough structure to determine all the needed apparatus (a model, sample space, hypothesis space, prior probabilities, sampling distribution). Almost all scientific problems pass through an initial exploratory phase in which we have need for inference, but the frequentist assumptions are invalid and the Bayesian apparatus is not yet available. Indeed, some of them never evolve out of the exploratory phase. Problems at this level call for more primitive means of assigning probabilities directly out of our incomplete information. For this purpose, the Principle of maximum entropy has at present the clearest theoretical justification and is the most highly developed computationally, with an analytical apparatus as powerful and versatile as the Bayesian one. To apply it we must define a sample space, but do not need any model or sampling distribution. In effect, entropy maximization creates a model for us out of our data, which proves to be optimal by so many different criteria that it is hard to imagine circumstances where one would not want to use it in a problem where we have a sample space but no model."


[graphical models]

  - combine probability theory with graphs
  - new insights into existing models
  - framework for designing new models
  - graph-based algorithms for calculation and computation
  - efficient software implementation

  when we have noisy data and uncertainty
  when we have lots of prior knowledge
  when we wish to reason about multiple variables
  when we want to construct richly structured models from modular building blocks

  * directed graphs to specify the model
  * factor graphs for inference and learning

  The biggest advantage of graphical models is a relatively simple way to distinguish conditionally independent variables, which simplify further analysis and allows to significantly lower number of factors given variable depends on.
  Graphical models/factor graphs are the formalism of choice for probabilistically coherent reasoning about situations. Where you have information, you can naturally build it in, in the form of potentials/factors/observed random variables. Where you have unobserved relationships you can often model them with latent (unobserved) variables. A variety of techniques for learning and inference in the presence of latent variables exist.
  Training is more complex in a directed model, because the model parameters are constrained to be probabilities - constraints which can make the optimization problem more difficult. This is in stark contrast to the joint likelihood, which is much easier to compute for directed models than undirected models.


  bayesian networks
    There are cases where supervised learning is not applicable: when there is not one target variable of interest but many, or when in each data point different variables might be available or missing.
    Typical example: medical domain with many kinds of deseases, symptoms, and context information: for a given patient little is known and one is interested in the prediction of many possible diseases and procedures.

    - encode dependencies among all variables, and hence it readily handles situations where some data entries are missing
    - can be used to learn causal relationships, and hence can be used to gain understanding about a problem domain and to predict the consequences of intervention
    - have both a causal and probabilistic semantics, and hence it is an ideal representation for combining prior knowledge (which often comes in causal form) and data
    - (like another bayesian statistical methods) offer an efficient and principled approach for avoiding the overfitting of data


  http://deeplearningbook.org/contents/graphical_models.html


  introduction by Dmitry Vetrov (in russian)
	http://youtube.com/watch?v=D_dNxrIazco
	http://youtube.com/watch?v=q-dpXbp16Lk

  introduction by Chris Bishop
	http://youtube.com/watch?v=ju1Grt2hdko
	http://youtube.com/watch?v=c0AWH5UFyOk
	http://youtube.com/watch?v=QJSEQeH40hM

  introduction by Nando de Freitas
	"Bayesian Networks" - https://youtube.com/watch?v=KJMJl1SWLIo + https://youtube.com/watch?v=XgP2hmf7X4U + https://youtube.com/watch?v=Xhdpk9HZQuo
	"Hidden Markov Models" - https://youtube.com/watch?v=jY2E6ExLxaw


  course by Daphne Koller
	"Probabilistic Graphical Models" course - https://coursera.org/course/pgm + https://youtube.com/playlist?list=PL50E6E80E8525B59C

  course by Christoph Lampert
	"Learning with Structured Data" - https://youtube.com/watch?v=uAsys22y5mY&list=PLEqoHzpnmTfA0wc1JxjoVVOrJlx8W0rGf

  course by Pedro Domingos
	"Statistical Learning" course - https://class.coursera.org/machlearning-001/

  course by Alex Smola
	"Directed Graphical Models" - http://youtube.com/watch?v=W6XyXeB3Cko + http://youtube.com/watch?v=0sYVPHrz9mc
	"Undirected Graphical Models" - http://youtube.com/watch?v=X3JudqgiffM


  Michael I. Jordan - "Introduction to Graphical Models" - https://goo.gl/hctNE5

  David Heckerman - "A Tutorial on Learning With Bayesian Networks" -
	http://research.microsoft.com/en-us/um/people/heckerman/tutorial.pdf

  advanced course by Michael Chertkov - https://youtube.com/watch?v=jGrgCd4U0sU&list=PLIvQImOQgbGaJrE8G-ZffKgPlj6k0dqbO


  https://theneural.wordpress.com/2011/07/17/undirected-models-are-better-at-sampling/ (Ilya Sutskever)


[non-parametric models]

  The advantage for Bayesian nonparametrics leans toward the advantage of Bayesian methods as a whole - interpretability with incredibly intuitive ways to quantify uncertainty - as well as extending itself to infinite-dimensional parameter spaces. The former allows one to form significance tests and continue to pose all sorts of interesting questions, rather than to stop simply at the discriminative model; the latter makes the model theoretically more justified than a finite space and thus more promising as it is not as reliant on feature engineering.

  The basic point of GPs is that they provide a prior distribution on real-valued functions. This lets you do regression as Bayesian inference: given observed data, Bayes' rule turns your prior on functions into a posterior distribution. Having a posterior distribution on functions, rather than just a single learned function, means you can reason about uncertainty of your predictions at any set of points in terms of means and (co)variances. Generally this means you'll make very confident predictions in regions where you have a lot of training data, becoming less confident as you move away from the training data. And you can use that uncertainty estimates to make decisions, e.g. collecting new data in the regions where your current beliefs are the most uncertain.

  Gaussian Processes rather than fitting, say, a two-parameter line or four-parameter cubic curve, actually fit an infinite-dimensional model to data. They accomplish this by judicious use of certain priors on the model, along with a so-called "kernel trick" which solves the infinite dimensional regression implicitly using a finite-dimensional representation constructed based on these priors.

  Gaussian processes are somewhat similar to Support Vector Machines - both use kernels and have similar scalability (which has been vastly improved throughout the years by using approximations). A natural formulation for GP is regression, with classification as an afterthought. For SVM it’s the other way around. Another difference is that GP are probabilistic from the ground up (providing error bars), while SVM are not. You can observe this in regression. Most “normal” methods only provide point estimates. Bayesian counterparts, like Gaussian processes, also output uncertainty estimates.


  "Yes, You Can Fit Models With More Parameters Than Data Points" - https://jakevdp.github.io/blog/2015/07/06/model-complexity-myth/

  https://reddit.com/r/MachineLearning/comments/3zwlpm/eli5_gaussian_processes/

  
  Zoubin Ghahramani - https://youtu.be/H7AMB0oo__4?t=21m51s + http://webdav.tuebingen.mpg.de/mlss2013/2015/slides/ghahramani/gp-neural-nets15.pdf

  Nando de Freitas - https://youtube.com/watch?v=4vGiHC35j9s + https://youtube.com/watch?v=MfHKW5z-OOA

  Tamara Broderick, Michael Jordan - "Nonparametric Bayesian Methods: Models, Algorithms, and Applications" -
	https://youtube.com/watch?v=I7bgrZjoRhM + https://youtube.com/watch?v=yfLoxwjCGNY + https://youtube.com/watch?v=2H2n4iUYpZE + https://youtube.com/watch?v=EUUyQbtUXR0

  Tamara Broderick - "Bayesian Nonparametrics" -
	https://youtube.com/watch?v=kKZkNUvsJ4M + https://youtube.com/watch?v=oPcv8MaESGY + https://youtube.com/watch?v=HpcGlr19vNk
	https://youtube.com/watch?v=FUL1DcjOjwo + https://youtube.com/watch?v=8duQxlppe5Y + https://youtube.com/watch?v=mC-jZcEb7ME

  Neil Lawrence - "Gaussian Processes" - https://youtube.com/watch?v=S9RbSCpy_pg + https://youtube.com/watch?v=MxeQIKGEXb8 + https://youtube.com/watch?v=Ead4TivIOmU


  Orbanz, Teh - "Bayesian Nonparametric Models" - http://www.stats.ox.ac.uk/~teh/research/npbayes/OrbTeh2010a.pdf
  Rasmussen, Williams - "Gaussian Processes for Machine Learning" - http://gaussianprocess.org/gpml/

  http://dustintran.com/blog/recurrent-gaussian-processes/


[expectation maximization]

  The EM algorithm estimates the parameters of a model iteratively, starting from some initial guess. Each iteration consists of an Expectation step, which finds the distribution for the unobserved variables, given the known values for the observed variables and the current estimate of the parameters, and a Maximization step, which re-estimates the parameters to be those with maximum likelihood, under the assumption that the distribution found in E step is correct. It can be shown that each such iteration improves the true likelihood, or leaves it unchanged (if a local maximum has already been reached).

  Replace counts with expectations of counts: Consider a particular data point l. In the E-step we calculate the probability for marginal probabilities of interest given the known information Xml in that data point and given the current estimates of the parameters Theta, using e.g. belief propagation. Then we get expected counts. Then in M-step we can estimate parameters using maximal likelihood.

  The E-step is really an inference step and approximate inference can be used (loopy belief propagation, MCMC, Gibbs, mean-field).

  Applicable to problems with missing data points which are uniformly distributed.

  Assumes that the conditional distribution of the hidden variables given the observed ones is easy to compute, and this is not always the case.


  https://reddit.com/r/MachineLearning/comments/3wr2qx/how_does_the_em_algorithm_work_for_discriminative/

  introduction by Alex Smola - https://youtu.be/PpX6hllPVLs?t=1h1m22s (EM on single slide)

  introduction by Dmitry Vetrov (in russian) -
	http://youtu.be/U0LylVL-zJM?t=35m59s + http://youtube.com/watch?v=CqjqTbUgbOo (Crisp EM, Variational EM, Stochastic EM)
	http://lectoriy.mipt.ru/lecture/DeepHack-L04-150722.03

  "EM Algorithm and Variants: an Informal Tutorial" by Alexis Roche - http://arxiv.org/abs/1105.1476


  "So I think it's easier to understand EM through the lens of variational inference or the so called 'missing data' formulation. Say you have an observed variable x, parameters θ, and latent variables z. You want to maximize L = log p(x|θ) = log ∫ p(x,z|θ)dz. But you don't know z. To cope we'll define a 'variational distribution' (or 'inference model') q(z).
  log ∫ p(x,z|θ)dz = log ∫ (q(z)/q(z))p(x,z|θ)dz ≥ <Jensen's inequality> ∫ q(z)log(p(x,z|θ)/q(z))dz = E[log p(x,z|θ)] + H[q(z)] where the expectation is over q and H[q(z)] is the entropy of the variational distribution. Since H[q(z)] is independent of θ, it can be dropped.
  When EM is derived, q(z) is almost always set as q(z) = p(z|x,θ), but this isn't necessary. The above will be true for any distribution over z. Different choices will just vary how tight the lower-bound is.
  Whatever we choose, what is have left is known as the 'Q-distribution': Q = E[log p(x,z|θ)] or Q = E[log p(x|θ)] if q(z) = p(z|x,θ) since p(x,z|θ)/q(z) = p(x|θ)(p(z|x,θ)/q(z))  
  EM is usually written as first computing Q then optimizing wrt θ, but it really can be written in one step, EM update: argmax_θ E[log p(x,z|θ)]
  Now if you want to think of p as a conditional classifier over labels y, you start out with L = log p(y|x,θ) = log ∫ p(y,z|θ,x)dz And the rest proceeds as above except conditioned on x."

  "The EM algorithm is a class of optimizers specifically taylored to ML problems, which makes it both general and not so general. Perhaps the most salient feature of EM is that it works iteratively by maximizing successive local approximations of the likelihood function. Therefore, each iteration consists of two steps: one that performs the approximation (the E-step) and one that maximizes it (the M-step). But, let's make it clear, not any two-step iterative scheme is an EM algorithm. For instance, Newton and quasi-Newton methods work in a similar iterative fashion but do not have much to do with EM. What essentially defines an EM algorithm is the philosophy underlying the local approximation scheme - which, for instance, doesn't rely on differential calculus.
  The key idea underlying EM is to introduce a latent variable Z whose PDF depends on θ with the property that maximizing p(z|θ) is easy or, say, easier than maximizing p(y|θ). Loosely speaking, we somewhat enhance the incomplete data by guessing some useful additional information. Technically, Z can be any variable such that θ -> Z -> Y is a Markov chain, i.e. we assume that p(y|z,θ) is independent from θ: p(z,y|θ) = p(z|θ)p(y|z).
  Original EM formulation stems from a very simple variational argument. Under almost no assumption regarding the complete variable Z, except its PDF doesn't vanish to zero, we can bound the variation of the log-likelihood function L(θ) = log p(y|θ) as follows:
  L(θ) - L(θ') = log (p(y|θ) / p(y|θ')) = log ∫ (p(z,y|θ) / p(y|θ'))dz = log ∫ (p(z,y|θ) / p(z,y|θ')) p(z|y,θ') dz = [step 1] log ∫ (p(z|θ) / p(z|θ')) p(z|y,θ') dz >= [step 2] (Q(θ,θ') = ∫ log(p(z|θ) / p(z|θ')) p(z|y,θ') dz)
  Step 1 results from the fact that p(y|z,θ) is indepedent from θ because of p(z,y|θ) = p(z|θ)p(y|z). Step 2 follows from Jensen's inequality along with the well-known concavity property of the logarithm function. Therefore Q(θ,θ') is an auxiliary function for the log-likelihood, in the sense that: (i) the likelihood variation from θ' to θ is always greater than Q(θ,θ'), and (ii) we have Q(θ,θ') = 0. Hence, starting from an initial guess θ', we are guaranteed to increase the likelihood value if we can find a θ such that Q(θ,θ') > 0. Iterating such a process defines an EM algorithm."
  There is no general convergence theorem for EM, but thanks to the above mentioned monotonicity property, convergence results may be proved under mild regularity conditions. Typically, convergence towards a non-global likelihood maximizer, or a saddle point, is a worst-case scenario. Still, the only trick behind EM is to exploit the concavity of the logarithm function!"


  "For approximating a posterior probability, variational Bayes (not related to variational inference) is an alternative to Monte Carlo sampling methods for taking a fully Bayesian approach to statistical inference over complex distributions that are difficult to directly evaluate or sample from. In particular, whereas Monte Carlo techniques provide a numerical approximation to the exact posterior using a set of samples, Variational Bayes provides a locally-optimal, exact analytical solution to an approximation of the posterior. Variational Bayes can be seen as an extension of the Expectation Maximization algorithm from maximum a posteriori estimation (MAP estimation) of the single most probable value of each parameter to fully Bayesian estimation which computes (an approximation to) the entire posterior distribution of the parameters and latent variables. As in EM, it finds a set of optimal parameter values, and it has the same alternating structure as does EM, based on a set of interlocked (mutually dependent) equations that cannot be solved analytically. For many applications, variational Bayes produces solutions of comparable accuracy to Gibbs sampling at greater speed. However, deriving the set of equations used to iteratively update the parameters often requires a large amount of work compared with deriving the comparable Gibbs sampling equations. This is the case even for many models that are conceptually quite simple, as is demonstrated below in the case of a basic non-hierarchical model with only two parameters and no latent variables."


[variational inference]

  Variational inference is an umbrella term for algorithms which cast Bayesian inference as optimization.


  At a high level, probabilistic graphical models have two kinds of variables: visible and hidden. Visible variables are the ones we observe, hidden variables are ones that we use as part of our model to explain relationships between visible variables or describe hidden causes behind the observations. These hidden variables may not correspond to observable quantities. For example, when modelling faces, observable variables might be raw pixel intensities in an image, hidden variables might be things that describe things like lighting, eye colour, face orientation, skin tone. Hidden variables, and the relationships between variables correspond to our model of how the world might work.
  Generally, we want to do be able to do two things with such models:
  - inference: which is determining the value (or conditional probability distribution) of hidden variables, given the observations. "Given a particular image with its pixel values, what are probable values of face orientation?"
  - learning: adjusting parameters of the model so it fits our dataset better. "How should we find the parameters that are most consistent with our observations?” This is particularly important in the deep learning flavour of probabilistic models where the relationship between hidden variables might be described by a deep neural network with several layers and millions of parameters.
  To solve these two problems, we often need the ability to marginalise, to calculate marginal probability distributions of subsets of the variables. In particular, we often want to calculate (and maximise) the marginal likelihood, or model evidence, which is the probability of observable variables, but with the hidden variables averaged out. Equivalently, one might phrase the learning and inference problems as evaluating normalisations constants or partition functions. Evaluating these quantities generally involves intractable integrals or enumerating and summing over exponentially many possibilities, so exact inference and learning in most models are practically impossible.
  One approach is to try to approximate that integrals by sampling, but often we're faced with a distribution we can't even easily obtain unbiased samples from, and we have to do use Markov chains, which may take a long time to visit all the places they need to visit for our estimate to be any good.
  Variational inference sidesteps the problem of calculating normalisation constants by constructing a lower bound to the marginal likelihood. For that we use an approximate posterior distribution, with a bunch of little knobs inside of it that we can adjust even per data point to make it as close to the real posterior as possible. Note that this optimization problem (of matching one distribution with another approximate one) doesn't involve the original intractable integrals we try to avoid. With some math we can show that this can give a lower bound on the thing we'd like to be maximizing (the probability of the data under our model), and so if we can optimize the parameters of our model with respect to the lower bound, maybe we'll be able to do something useful with respect to the thing we actually care about.


  Variational inference is a paradigm where instead of trying to compute exactly the posterior distribution one searches through a parametric family for the closest (in relative entropy) distribution to the true posterior. The key observation is that one can perform stochastic gradient descent for this problem without having to compute the normalization constant in the posterior distribution (which is often an intractable problem). The only catch is that in order to compute the required gradients one needs to be able to use sample from variational posterior (sample an element of the parametric family under consideration conditioned on the observed data), and this might itself be a difficult problem in large-scale applications.

  Variational inference provides an optimization-based alternative to the sampling-based Monte Carlo methods, and tend to be more efficient. They involve approximating the exact posterior using a distribution from a more tractable family, often a fully factored one, by maximizing a variational lower bound on the log-likelihood w.r.t. the parameters of the distribution. For a small class of models, using such variational posteriors allows the expectations that specify the parameter updates to be computed analytically. However, for highly expressive models such as the ones we are interested in, these expectations are intractable even with the simplest variational posteriors. This difficulty is usually dealt with by lower bounding the intractable expectations with tractable one by introducing more variational parameters. However, this technique increases the gap between the bound being optimized and the log-likelihood, potentially resulting in a poorer fit to the data. In general, variational methods tend to be more model-dependent than sampling-based methods, often requiring non-trivial model-specific derivations.

  Traditional unbiased inference schemes such as Markov Chain Monte Carlo are often slow to run and difficult to evaluate in finite time. In contrast, variational inference allows for competitive run times and more reliable convergence diagnostics on large-scale and streaming data - while continuing to allow for complex, hierarchical modelling. The recent resurgence of interest in variational methods includes new methods for scalability using stochastic gradient methods, extensions to the streaming variational setting, improved local variational methods, inference in non-linear dynamical systems, principled regularisation in deep neural networks, and inference-based decision making in reinforcement learning, amongst others. Variational methods have clearly emerged as a preferred way to allow for tractable Bayesian inference. Despite this interest, there remain significant trade-offs in speed, accuracy, simplicity, applicability, and learned model complexity between variational inference and other approximative schemes such as MCMC and point estimation.


  "Variational Inference: Foundations and Modern Methods" tutorial at NIPS 2016 by David Blei, Rajesh Ranganath, Shakir Mohamed -
	https://channel9.msdn.com/Events/Neural-Information-Processing-Systems-Conference/Neural-Information-Processing-Systems-Conference-NIPS-2016/Variational-Inference-Foundations-and-Modern-Methods
	http://www.cs.columbia.edu/~blei/talks/2016_NIPS_VI_tutorial.pdf

  "Reparametrization trick: Revolution in Stochastic Computational Graphs" by Dmitry Vetrov - https://youtu.be/0q5p7xP4cdA?t=5h3m29s (in russian)


  introduction by Zoubin Ghahramani - https://youtu.be/yzNbaAPKXA8?t=19m45s

  introduction by Jordan Boyd-Graber - https://youtube.com/watch?v=2pEkWk-LHmU


  http://davmre.github.io/inference/2015/11/13/elbo-in-5min/
  http://davmre.github.io/inference/2015/11/13/general_purpose_variational_inference/

  http://barmaley-exe.github.io/posts/2016-07-01-neural-variational-inference-classical-theory.html
  http://barmaley-exe.github.io/posts/2016-07-04-neural-variational-inference-stochastic-variational-inference.html
  http://barmaley-exe.github.io/posts/2016-07-05-neural-variational-inference-blackbox.html

  "Variational Inference for Machine Learning" by Shakir Mohamed - http://shakirm.com/papers/VITutorial.pdf

  introduction by Blei, Kucukelbir, McAuliffe - http://arxiv.org/abs/1601.00670
  introduction by David Blei - https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf
  "An Introduction to Variational Methods for Graphical Model" by Jordan, Ghahramani, Jaakkola, Saul - https://www.cs.berkeley.edu/~jordan/papers/variational-intro.pdf
  introduction by Dmitry Vetrov (in russian) - https://dropbox.com/s/t3iyeu78o0c86kq/Dmitry%20Vetrov%20-%20Variational%20Inference.pdf


  variational inference

	(Ferenc Huszar) "Variational inference is useful for dealing with latent variable models. Let's assume that for each observation x we assign a hidden variable z. Our model pθpθ describes the joint distribution between x and z. In such a model, typically:
	pθ(z) is very easy  ( 🐣  )
	pθ(x|z) is easy  ( 🐹  )
	pθ(x,z) is easy  ( 🐨  )
	pθ(x) is super-hard  ( 🐍  )
	pθ(z|x) is mega-hard  ( 🐲  )
	to evaluate. Unfortunately, in machine learning the things we want to calculate are exactly the bad guys, 🐍  and 🐲 

	- inference is evaluating pθ(z|x)  ( 🐲  )
	- learning (via maximum likelihood) involves pθ(x)  ( 🐍  )

	Variational lower bounds give us ways to approximately perform both inference and maximum likelihood parameter learning, by approximating the posterior 🐲  with a simpler, tamer distribution, qψ(z|x) ( 🐰 ) called the approximate posterior or recognition model. Variational inference and learning involves maximising the evidence lower bound (ELBO):

	ELBO(θ,ψ) = ∑n log p(xn) − KL[qψ(z|xn)∥pθ(z|xn)]  or  💪  = ∑n log🐍 - KL[ 🐰 || 🐲 ]

	This expression is still full of 🐍  and 🐲s but the nice thing about it is that it can be written in more convenient forms which only contain the good gu 🐣 🐹 🐨 🐰 

	💪  = − ∑nE🐰 log ( 🐰 / 🐨 ) + constant = ∑n E🐰 log 🐹 - E🐰 KL[ 🐰 || 🐣  ]

	Both expressions only contain nice, tame distributions and do not need explicit evaluation of either the marginal likelihood 🐍  or the posterior 🐲.

	ELBO is - as the name suggests - a lower bound to the model evidence or log likelihood. Therefore, maximising it with respect to θ and ψ approximates maximum likelihood learning, while you can use the recognition model 🐰 instead of 🐉 to perform tractable approximate inference."


	Evidence Lower Bound (variational lower bound):
		log pθ(x) = L(θ,φ,x) + Dkl(qφ(z|x) || pθ(z|x)), where
			L(θ,φ,x) := ∫ qφ(z|x)log(pθ(x,z)/qφ(z|x))dz    (ELBO or, in case of unnormalized factors in qφ(z|x), variational free energy)
			Dkl(qφ(z|x) || pθ(z|x)) = ∫ qφ(z|x)log(qφ(z|x)/pθ(z|x))dz
		Dkl(qφ(z|x) || pθ(z|x)) >=0  =>  L(θ,φ,x) <= log pθ(x) and L(θ,φ,x) is variational lower bound on the log-likelihood/evidence (ELBO)

	Likelihood Ratio estimator for ELBo:
		∇φL(θ,φ,x) = ∫ ∇φ( qφ(z|x)(log(pθ(x,z))-log(qφ(z|x))) )dz = ∫ ∇φqφ(z|x)(log(pθ(x,z))-log(qφ(z|x)))dz - ∫ qφ(z|x)∇φlog(qφ(z|x))dz
			∫ qφ(z|x)∇φlog(qφ(z|x))dz = ∫ ∇φqφ(z|x)dz = ∇φ∫ qφ(z|x)dz = ∇φ1 = 0
		∇φL(θ,φ,x) = ∫ qφ(z|x)∇φlog(qφ(z|x))(log(pθ(x,z))-log(qφ(z|x))dz ~ 1/N Σx ∇φlog(qφ(z|x))(log(pθ(x,z))-log(qφ(z|x) + K)
			where K is some constant (optimal baseline) to reduce variance of estimate

		suffers from high variance because expectation is with respect to q and because q depends on the parameters with respect to which differentiating is performed

	Pathwise Derivative estimator for ELBo:
		rewriting ELBo such that the expectation distribution does not depend on ε
		expressing z as samples from un-parameterized distribution followed by a parameterized deterministic transformation

		z = g(ε;φ), ε~r(.)
		∇φL(θ,φ,x) = ∇φ ∫qφ(z|x)(log(pθ(x,z))-log(qφ(z|x)))dz = ∇φ ∫qφ(z|x)(log(p(g(ε;φ),y)-log(q(g(ε;φ)|x))))dz = ∫qφ(z|x)∇φ(log(p(g(ε;φ),y)-log(q(g(ε;φ)|x))))dz

		transforms both variational and model distributions into distributions over independent random ‘noise’ variables followed by complex, parameterized, deterministic transformations
		given a fixed assignment to the noise variables, derivatives can propagate from the final log probabilities back to the input parameters, leading to much more stable gradient estimates than with the Likelihood Ratio estimator


  variational autoencoder

	introduction -
		http://kvfrans.com/variational-autoencoders-explained/
		http://jaan.io/what-is-variational-autoencoder-vae-tutorial/
		http://hsaghir.github.io/denoising-vs-variational-autoencoder/
		http://vdumoulin.github.io/morphing_faces/
	tutorial - http://arxiv.org/abs/1606.05908 + https://github.com/cdoersch/vae_tutorial
	papers - http://dustintran.com/blog/variational-auto-encoders-do-not-train-complex-generative-models/

	Aaron Courville - http://videolectures.net/deeplearning2015_courville_autoencoder_extension/
	Dmitry Vetrov - https://youtu.be/_qrHcSdQ2J4?t=1h37m21s (in russian)
	Karol Gregor - https://dl.dropboxusercontent.com/u/16027344/ICML%202015%20Deep%20Learning%20Workshop/Karol%20Gregor%2C%20GOOGLE%20Deepmind.p2g/Default.html
	Karol Gregor - http://youtube.com/watch?v=P78QYjWh5sM
	Durk Kingma - http://youtube.com/watch?v=rjZL7aguLAs


	"The difference between traditional variational methods and variational autoencoders is that in VAE, the local approximate posterior, q(zi|xi) is produced by a closed-form differentiable procedure (such as a neural network), as opposed to a local optimization. This allows the model and inference strategy to be joinly optimized."

	"VAE framework aims to maximize the log-likelihood of the observed data x by introducing a set of stochastic latent variables z and marginalizing them out of the joint distribution p(x,z). While exact marginalization of the latent variables is generally intractable, the VAE introduces an approximate posterior q(z|x) and maximizes a variational lower bound on the log-likelihood of p(x). VAE allows powerful generative models to be trained efficiently by replacing slow iterative inference algorithms with fast feedforward approximate inference neural networks. The inference networks, which map observations to samples from the variational posterior, are trained jointly with the model by maximizing a common objective. This objective is a variational lower bound on the marginal log-likelihood."

	"Let x be a random variable (real or binary) representing the observed data and z a collection of real-valued latent variables. The generative model over the pair (x,z) is given by p(x,z) = p(x|z)p(z), where p(z) is the prior distribution over the latent variables and p(x|z) is the conditional likelihood function. Generally, we assume that the components of z are independent Bernoulli or Gaussian random variables. The likelihood function is parameterized by a deep neural network pθ(x|z) = N(x|mu_p(z),sigma_p(z)) referred to as the decoder. A key aspect of VAEs is the use of a learned approximate inference procedure that is trained purely using gradient-based methods. This is achieved by using a learned approximate posterior qφ(z|x) = N(z|mu_q(x),sigma_q(x)) whose parameters are given by another deep neural network referred to as the encoder. Thus, we have z∼ Enc(x) = q(z|x) and x∼ Dec(z) = p(x|z). The parameters of these networks are optimized by minimizing the upper-bound on the expected negative log-likelihood of x, which is given by Eq(z|x)[-log pθ(x|z)]+KL(q(z|x)||p(z)). The first term corresponds to the reconstruction error, and the second term is a regularizer that ensures that the approximate posterior stays close to the prior."

	encoder learns to approximate pθ(z|x) by maximizing a variational lower bound on the data log-likelihood: log pθ(x) ≥ L(θ,φ,x)

	log pθ(x) = log ∫ pθ(x,z)dz = log ∫ (qφ(z|x)/qφ(z|x))pθ(x,z)dz ≥ <Jensen's inequality> [L(θ,φ,x):=] ∫ qφ(z|x)log(pθ(x,z)/qφ(z|x))dz = ∫ qφ(z|x)(log pθ(x,z) - log qφ(z|x))dz = ∫ qφ(z|x)((log pθ(x|z) + log pθ(z)) - log qφ(z|x))dz = ∫ qφ(z|x)(log pθ(x|z) - log(qφ(z|x)/pθ(z)))dz = ∫ qφ(z|x)log pθ(x|z)dz <reconstruction term> - Dkl(qφ(z|x)||pθ(z)) <regularization term>

	variational lower bound:
		log pθ(x) >= L(θ,φ,x) = Lz(θ,φ,x) + Lx(θ,φ,x), where
			Lz(θ,φ,x) := - Dkl(qφ(z|x)||pθ(z))    (regularization term, negative expected reconstruction error of x under the conditional likelihood with respect to qφ(z|x))
			Lx(θ,φ,x) := ∫ qφ(z|x)log pθ(x|z)dz    (reconstruction term, decoder pθ(x|z) evaluating codes from the encoder qφ(z|x))


	"First, we're talking about reconstruction process. In order to reconstruct the input x you need to obtain its latent representation z using encoder q(z|x). Since q(z|x) is a distribution, you sample z from that distribution. Now you can either take the mean of p(x|z) as your reconstruction, or, again, sample from this distribution. The difference shouldn't matter in low dimensional spaces since most of the mass of normal distribution is concentrated around the mean, and normal distribution has little probability mass on its tails (i.e. it's not heavy-tailed).
	Then, there's also sampling process. Remember that VAE is a generative (unsupervised) model, so we'd like to sample unseen x's from the model. If we didn't see them, we can't compute corresponding q(z|x) to sample z from. This is where the prior p(z) comes in: during the learning we optimized both reconstruction error and "regularization" term KL(q(z|x)||p(z)), which kept our encoder close to the prior. Now in order to sample from the model we first sample z from p(z) (in the paper it's standard multivariate Gaussian N(0, I)), and then use that z in the decoder p(x|z)."

	"In variational auto-encoder there are two forces acting on the sampling layer. One is the likelihood (i.e. loss from the decoder p(x|z)) walks decoder to be able to reconstruct individual examples as best as it can. This term tries to make z as unique as possible so that it can reconstruct the x as accurately as possible. Another one is regularization term which wants posterior distribution to be as close to prior on z as possible. This term tries to make output of q indepedent of x. Competition between these two terms is what makes learning the variance of the distribution work, if you take away the KL term then the variance of the encoder will collapse."

	"Reconstruction term will walk decoder to be able to reconstruct individual examples as best as it can. This term tries to make z as unique as possible so that it can reconstruct the x as accurately as possible. Regularization term wants posterior distribution to be as close to prior on z as possible. This term tries to make output of q indepedent of x."

	"The difference between VAE and conventional autoencoder is, given a probability distribution, VAE learns the best possible representation that is parametrized by defined distribution. Let's say we want to fit gaussian distribution to the data. Then, it is able to learn mean and standard deviation of the multiple gaussian functions (corresponding VAE latent units) with backpropagation with a simple parametrization trick. Eventually, you obtain multiple gaussians with different mean and std on the latent units of VAE and you can sample new instances out of these."

	"Current best practice in variational inference performs optimization of ELBO using mini-batches and stochastic gradient descent, which is what allows variational inference to be scaled to problems with very large data sets. There are two problems that must be addressed to successfully use the variational approach: 1) efficient computation of the derivatives of the expected log-likelihood ∇φEqφ(z)[log pθ(x|z)], and 2) choosing the richest, computationally-feasible approximate posterior distribution q(·). The bulk of research in variational inference over the years has been on ways in which to compute ∇φEqφ(z)[log p(x|z)]. Whereas we would have previously resorted to local variational methods, in general we now always compute such expectations using Monte Carlo approximations (including the KL term in the bound, if it is not analytically known). This forms what has been aptly named doubly stochastic estimation, since we have one source of stochasticity from the minibatch and a second from the Monte Carlo approximation of the expectation."

	stochastic backpropagation or stochastic variational gradient bayes:
		reparametrization trick:  E p(y|x) [g(y)] = ∫ g(f(x,ξ))ρ(ξ)dξ,
			where ξ ~ ρ(.) a fixed noise distribution and y=f(x,ξ) is differentiable transformation (such as a location-scale transformation or cumulative distribution function)

		"The reparameterization trick enables the optimization of large scale stochastic computation graphs via gradient descent. The essence of the trick is to refactor each stochastic node into a differentiable function of its parameters and a random variable with fixed distribution. After refactoring, the gradients of the loss propagated by the chain rule through the graph are low variance unbiased estimators of the gradients of the expected loss."

		"The "trick" part of the reparameterization trick is that you make the randomness an input to your model instead of something that happens "inside" it, which means you never need to differentiate with respect to sampling (which you can't do). Since the randomness is an input the whole network is deterministic, and you can differentiate the whole thing as normal. In particular, consider the following two ways of writing the objective:
			f(z) where z = gφ(eps, x) and eps ~ p(eps)
			f(z) where z ~ pφ(x)
		In the first version you can compute the gradient of f with respect to phi, because the sampling has been "moved out of the way", but in the second version the sampling step "blocks" the gradient from z to phi."

		location-scale transformation:  z∼ N(z|µ,σ^2) <=> z = µ+σε, ε∼ N(0,1)

		backpropagation with Monte Carlo:  ∇φ(E qφ(z) [fθ(z)]) <=> E N(ε|0,1) [∇φfθ(µ + σε)]

		"A number of general purpose approaches based on Monte Carlo control variate estimators exist as an alternative to stochastic backpropagation, and allow for gradient computation with latent variables that may be continuous or discrete. An important advantage of stochastic backpropagation is that, for models with continuous latent variables, it has the lowest variance among competing estimators."

		http://blog.shakirm.com/2015/10/machine-learning-trick-of-the-day-4-reparameterisation-tricks/


[sampling]

  introduction - http://johndcook.com/blog/2016/01/23/introduction-to-mcmc/

  introduction by Nando de Freitas - https://youtube.com/watch?v=TNZk8lo4e-Q + https://youtube.com/watch?v=sK3cg15g8FI
  introduction by Igor Kuralenok (in russian) - http://youtube.com/watch?v=4qfTUF9LudY
  introduction by Alex Smola - https://youtube.com/watch?v=M6aoDSsq2ig
  introduction by Bob Carpenter - https://youtu.be/qQFF4tPgeWI?t=1h55m39s

  "Monte Carlo Inference Methods" tutorial by Iain Murray - http://research.microsoft.com/apps/video/default.aspx?id=259575


  "Markov Chain Monte Carlo Without all the Bullshit" by Jeremy Kun - http://jeremykun.com/2015/04/06/markov-chain-monte-carlo-without-all-the-bullshit/

  "The Markov Chain Monte Carlo Revolution" by Persi Diaconis - http://math.uchicago.edu/~shmuel/Network-course-readings/MCMCRev.pdf

  "Monte Carlo Methods, Stochastic Optimization" course by Kaynig-Fittkau and Protopapas - http://am207.org


  "History of Monte Carlo Methods" by Sebastian Nowozin -
	http://nowozin.net/sebastian/blog/history-of-monte-carlo-methods-part-1.html
	http://nowozin.net/sebastian/blog/history-of-monte-carlo-methods-part-2.html
	http://nowozin.net/sebastian/blog/history-of-monte-carlo-methods-part-3.html


  - want accurate distribution of the posterior
  - sample from posterior distribution rather than maximimizing it as in EM
  - sample subset of variables while keeping the rest fixed, iterate until converged, draw several samples

  problem: direct sampling is usually intractable
  solutions:
    * Markov Chain Monte Carlo (complicated)
    * Gibbs Sampling (somewhat simpler): draw one group at a time and iterate


  Monte Carlo methods are a diverse class of algorithms that rely on repeated random sampling to compute the solution to problems whose solution space is too large to explore systematically or whose systemic behavior is too complex to model.

  Often in bayesian hierarchical models the numerator of the distribution of interest is often easy to calculate but the normalizing constant/denominator is intractable since it is a sum/integral over the entire space. In these cases something like a Metropolis-Hastings works because you only need the ratio of two evaluations to get the accept/reject probability so normalizing constant factors out. A Gibbs sampler works because you reduce the problem of sampling from the whole distribution to iteratively sampling from simpler conditional distributions for which efficient sampling routines exist.

  In importance sampling you essentially add a bias so that you draw more samples from low probability regions. The only problem is that you have to know where to add the biases, so you first have to know something about the probability distribution. After you draw your samples, you have to re-weight to remove the effect of the bias terms.

  Gibbs sampling starts from a random possible world and iterates over each variable v, computing a new value for it according to a probability computed by taking into account the factor functions of the factors that v is connected to and the values of the variables connected to such factors (this is known as the Markov blanket of v), then the process moves to a different variable and iterates. After enough iterations over the random variables, we can compute the number of iterations during which each variable had a specific value and use the ratio between this quantity and the total number of iterations as an estimate of the probability of the variable taking that value.

  Allows to calculate joint distributions and not only marginal ones as opposed to message passing methods, expectation propagation (belief propagation as particular case) and variational belief propagation.


[likelihood-free inference]

  Approximate Bayesian Computation

  http://dennisprangle.github.io/research/2016/06/07/bayesian-inference-by-neural-networks
  http://dennisprangle.github.io/research/2016/06/07/bayesian-inference-by-neural-networks2

  "Machine Learning and Likelihood-Free Inference in Particle Physics" by Kyle Cranmer - https://channel9.msdn.com/Events/Neural-Information-Processing-Systems-Conference/Neural-Information-Processing-Systems-Conference-NIPS-2016/Machine-Learning-and-Likelihood-Free-Inference-in-Particle-Physics

  Some statistical models are specified via a data generating process for which the likelihood function cannot be computed in closed form. Standard likelihood-based inference is then not feasible but the model parameters can be inferred by finding the values which yield simulated data that resemble the observed data.

  Classical approaches to statistical inference are based on the probability (or probability density) of the observed data y0 given particular parameter values θ. This is known as the likelihood function, π(y0|θ). Since y0 is fixed this is a function of θ and so can be written L(θ). Approaches to inference involve optimising this, used in maximum likelihood methods, or exploring it, for Bayesian methods, which are described in more detail shortly. A crucial implicit assumption of both approaches is that it’s possible and computationally inexpensive to numerically evaluate the likelihood function.

  As computing power has increased over the last few decades, there are an increasing number of interesting situations for which this assumption doesn’t hold. Instead models are available from which data can be simulated, but where the likelihood function is intractable, in that it cannot be numerically evaluated in a practical time. Examples include models of: climate, high energy physics reactions, variation in genetic sequences over a population, molecular level biological reactions, infectious diseases. One common reason for intractability is that there are a very large number of ways in which the observable data can be generated, and it would be necessary to sum the probability contributions of all of these.

  Several methods have been proposed for inference using simulators rather than the likelihood function, sometimes called “likelihood-free inference”. One of the most popular is approximate Bayesian computation. The simplest version of this is based on rejection sampling:
  - Sample θi values for 1≤i≤n from π(θ).
  - Simulate datasets yi from the model given parameters θi for 1≤i≤n.
  - Accept parameters for which d(yi,y0)≤ϵ, and reject the remainder.

  This approach faces at least two major difficulties: The first difficulty is the choice of the discrepancy measure which is used to judge whether the simulated data resemble the observed data. The second difficulty is the computationally efficient identification of regions in the parameter space where the discrepancy is low.


[causal inference]

The goal of causal inference is to understand the outcome of alternative courses of action.


introduction by Adam Kelleher
	https://medium.com/@akelleh/causal-data-science-721ed63a4027
	https://medium.com/@akelleh/a-technical-primer-on-causality-181db2575e41
	https://medium.com/@akelleh/if-correlation-doesnt-imply-causation-then-what-does-c74f20d26438
	https://medium.com/@akelleh/understanding-bias-a-pre-requisite-for-trustworthy-results-ee590b75b1be
	https://medium.com/@akelleh/speed-vs-accuracy-when-is-correlation-enough-when-do-you-need-causation-708c8ca93753

tutorial by David Sontag and Uri Shalit at ICML 2016 - http://techtalks.tv/talks/causal-inference-for-observational-studies/62355/

tutorial by Jonas Peters at MLSS Cadiz 2016 - https://youtube.com/watch?v=_wFagI5Fn9I + https://youtube.com/watch?v=5cjmlcmhisw


overview by Judea Pearl - https://www.edge.org/conversation/judea_pearl-engines-of-evidence

"The Art and Science of Cause and Effect" by Judea Pearl - http://bayes.cs.ucla.edu/BOOK-2K/causality2-epilogue.pdf


(Judea Pearl) "What is more likely, that a daughter will have blue eyes given that her mother has blue eyes or the other way around—that the mother will have blue eyes given that the daughter has blue eyes? Most people will say the former—they'll prefer the causal direction. But it turns out the two probabilities are the same, because the number of blue-eyed people in every generation remains stable. I took it as evidence that people think causally, not probabilistically—they're biased by having easy access to causal explanations, even though probability theory tells you something different.
  There are many biases in our judgment that are created by our inclination to attribute causal relationships where they do not belong. We see the world as a collection of causal relationships and not as a collection of statistical or associative relationships. Most of the time, we can get by, because they are closely tied together. Once in a while we fail. The blue-eye story is an example of such failure.
  The slogan, "Correlation doesn't imply causation" leads to many paradoxes. For instance, the size of a child's thumb is highly correlated with their reading ability. So, naively, if you want to be taller, you should learn to read better. This kind of paradoxical example convinces us that correlation does not imply causation. Still, people fall into that trap quite often because they crave causal explanations. The mind is a causal processor, not an association processor. Once you acknowledge that, the question remains how we reconcile the discrepancies between the two. How do we organize causal relationships in our mind? How do we operate on and update such a mental presentation?"

(Dustin Tran) "To me, causal inference is one of the most interesting fields in statistics and machine learning, and with the greatest potential for long term impact. It can significantly speed up progress towards something like artificial general intelligence (and is arguably necessary to achieve it). And most immediately, it enables richer data analyses to capture scientific phenomena. In order for our models to truly infer generative processes, they must understand and learn causal notions of the world.
  Much of the work in the causal inference community has focused on nonparametric models, which make few modeling assumptions. They satisfy theoretic notions such as asymptotics and can perform well on small-to-medium size data sets (a typical setting setting in applied causal inference). However, in higher-dimensional and massive data settings, we require more complex generative models, as we’ve seen in probabilistic machine learning."


[interesting quotes]

  (David Barber) "For me Bayesian Reasoning is probability theory extended to treating parameters and models as variables. In this sense, for me the question is essentially the same as `what makes probabiltiy appealing?' Probability is a (some people would say *the*) logical calculus of uncertainty. There are many aspects of machine learning in which we naturally need to deal with uncertainty. I like the probability approach since it naturally enables one to integrate prior knowledge about a problem into the solution. It does this also in a way that requires one to be explicit about the assumptions being made about the model. People have to be clear about their model specification some people might not agree with that model, but at least they know what the assumptions of the model are."

  (Alex Lamb) "Why we care about probabilistic models? The first question is complicated and still hotly debated. I suppose the main advantage to a probabilistic model is that problems have uncertainty, and probability provides a well-defined way of quantifying that uncertainty. You can rely on all of the existing research on sampling from distributions, doing inference over distributions, conditioning on distributions, and so on."

  (Zoubin Ghahramani) "The key ingredient of Bayesian methods is not the prior, it's the idea of averaging over different possibilities."

  () "The frequentist vs. Bayesian debate that raged for decades in statistics before sputtering out in the 90s had more of a philosophical flavor. Starting with Fisher, frequentists argued that unless a priori probabilities were known exactly, they should not be "guessed" or "intuited", and they created many tools that did not require the specification of a prior. Starting with Laplace, Bayesians quantified lack of information by means of a "uninformative" or "objective" uniform prior, using Bayes theorem to update their information as more data came in. Once it became clear that this uniform prior was not invariant under transformation, Bayesian methods fell out of mainstream use. Jeffreys led a Bayesian renaissance with his invariant prior, and Lindley and Savage poked holes in frequentist theory. Statisticians realized that things weren't quite so black and white, and the rise of MCMC methods and computational statistics made Bayesian inference feasible in many, many new domains of science. Nowadays, few statisticians balk at priors, and the two strands have effectively merged (consider the popularity of empirical Bayes methods, which combine the best of both schools). There are still some Bayesians that consider Bayes theorem the be-all-end-all approach to inference, and will criticize model selection and posterior predictive checks on philosophical grounds. However, the vast majority of statisticians will use whatever method is appropriate. The problem is that many scientists aren't yet aware of/trained in Bayesian methods and will use null hypothesis testing and p-values as if they're still the gold standard in statistics."

  () "Bayesian methods have a nice intuitive flow to them. You have a belief (formulated into a prior), you observe data and evaluate it in the context of a likelihood function that you think fits the data generation process well, you have a new updated belief. Nice, elegant, intuitive. I thought this, I saw that, now I think this. Compared to like a maximum likelihood method that will answer the question of what parameters with this likelihood function best fit my data. Which doesn't really answer your actual research question. If I flip a coin one time and get heads, and do a maximum likelihood approach, then it's going to tell me that the type of coin most likely to have given me that result is a double-headed coin. That's probably not the question you had, you probably wanted to know "what's the probability that this comes up heads?" not "what type of coin would give me this result with the highest probability?"."

  () "Bayesian modelling is more elegant, but requires more story telling, which is bad. For instance the recent paper about bayesian program induction requires an entire multilevel story about how strokes are created and how they interact. Just flipping a coin requires a story about a mean and prior distribution over the mean and the hyperparameters describing the prior. It's great but I am a simple man and I just want input output. The other criticism is bayesian cares little for actual computational resources. I just want a simple neural net that runs in linear/polytime, has a simple input-output interpretation, no stories required, to heck if its operation is statistically theoretically unjustified or really even outside of the purview of human understanding to begin with, as long as it vaguely seems to do cool stuff."

  () "An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem."

  () "The choice is between A) finding a point estimate of parameters that minimizes some ad hoc cost function that balances the true cost and some other cost designed to reduce overfitting, and Bayes) integrating over a range of models with respect to how well they fit the data. Optimization isn't fundamentally what modeling data is about. Optimization is what you do when you can't integrate. Unfortunately you're left with hyperparameters to tune and you often fall back on weak forms of integration: cross validation and model averaging."

  () "The rule-based system, scientifically speaking, was on the wrong track. They modeled the experts instead of modeling the disease. The problems were that the rules created by the programmers did not combine properly. When you added more rules, you had to undo the old ones. It was a very brittle system. A new thinking came about in the early '80s when we changed from rule-based systems to a Bayesian network. Bayesian networks are probabilistic reasoning systems. An expert will put in his or her perception of the domain. A domain can be a disease, or an oil field—the same target that we had for expert systems. The idea was to model the domain rather than the procedures that were applied to it. In other words, you would put in local chunks of probabilistic knowledge about a disease and its various manifestations and, if you observe some evidence, the computer will take those chunks, activate them when needed and compute for you the revised probabilities warranted by the new evidence. It's an engine for evidence. It is fed a probabilistic description of the domain and, when new evidence arrives, the system just shuffles things around and gives you your revised belief in all the propositions, revised to reflect the new evidence."

  () "If you are able to create a successful generative model, than you now understand more about the underlying science of the problem. You're not just able to fit data well, but you have a model for how the process that generates the data works. If you're just trying to build the best classifier you can with the resources you have, this might not be that useful, but if you're interested in the science of the system that generated this data, this is crucial. What is also crucial is that you can often sacrifice some accuracy to simplify a lot your generative model and obtain really simple mechanisms that tell you a lot about the basic science of the system. Of course it's difficult to see how this transfers to problems like generative deep models for computer vision, where your model is a huge neural network that is not as transparent to read as a simple bayesian model. But I think part of the goal is this: hey, look at this particular filter the network learned - it can generate X or Y objects when I turn it on and off. Now we understand a little more of how we perceive objects X and Y. There's also a feeling that generative models will eventually be more accurate if you can find the "true" generative process that created the data and that nothing could be more accurate than this (after all, this is the true process)."

  (John D. Cook) "The primary way to quantify uncertainty is to use probability. Subject to certain axioms that aim to capture common-sense rules for quantifying uncertainty, probability theory is essentially the only way. (This is Cox’s theorem.) Other methods, such as fuzzy logic, may be useful, though they must violate common sense (at least as defined by Cox’s theorem) under some circumstances. They may be still useful when they provide approximately the results that probability would have provided and at less effort and stay away from edge cases that deviate too far from common sense. There are various kinds of uncertainty, principally epistemic uncertainty (lack of knowledge) and aleatory uncertainty (randomness), and various philosophies for how to apply probability. One advantage to the Bayesian approach is that it handles epistemic and aleatory uncertainty in a unified way."

  (Abram Demski) "A Bayesian learning system has a space of possible models of the world, each with a specific weight, the prior probability. The system can converge to the correct model given enough evidence: as observations come in, the weights of different theories get adjusted, so that the theory which is predicting observations best gets the highest scores. These scores don't rise too fast, though, because there will always be very complex models that predict the data perfectly; simpler models have higher prior weight, and we want to find models with a good balance of simplicity and predictive accuracy to have the best chance of correctly predicting the future."

  (Yann LeCun) "I think if it were true that P=NP or if we had no limitations on memory and computation, AI would be a piece of cake. We could just brute-force any problem. We could go "full Bayesian" on everything (no need for learning anymore. Everything becomes Bayesian marginalization). But the world is what it is."

  () "Imagine if back in Newton's day, they were analyzing data from physical random variables with deep nets. Sure, they might get great prediction accuracy on how far a ball will go given measurements of its weight, initial force/angle, and some other irrelevant variables, but would this really be the best approach to discover all of the useful laws of physics such as f = ma and the conversion from potential to kinetic energy via the gravitational constant? Probably not, in fact the predictions might be in some sense "too good" incorporating other confounding effects such as air drag and the shape / spin of the ball which obfuscate the desired law. In many settings where an interpretation of what is going on in the data is desired, a clear model is necessary with simple knobs that have clear effects when turned. This may also be a requirement not only for human interpretation, but an also AI system which is able to learn and combine facts about the world (rather than only storing the complex functions which represent the relationships between things as inferred by a deep-net)."

  (Daphne Koller)  "Uncertainty is unavoidable in real-world applications: we can almost never predict with certainty what will happen in the future, and even in the present and the past, many important aspects of the world are not observed with certainty. Probability theory gives us the basic foundation to model our beliefs about the different possible states of the world, and to update these beliefs as new evidence is obtained. These beliefs can be combined with individual preferences to help guide our actions, and even in selecting which observations to make. While probability theory has existed since the 17th century, our ability to use it effectively on large problems involving many inter-related variables is fairly recent, and is due largely to the development of a framework known as Probabilistic Graphical Models. This framework, which spans methods such as Bayesian networks and Markov random fields, uses ideas from discrete data structures in computer science to efficiently encode and manipulate probability distributions over high-dimensional spaces, often involving hundreds or even many thousands of variables."

  (Michael Jordan) "Probabilistic graphical models are one way to express structural aspects of joint probability distributions, specifically in terms of conditional independence relationships and other factorizations. That's a useful way to capture some kinds of structure, but there are lots of other structural aspects of joint probability distributions that one might want to capture, and PGMs are not necessarily going to be helpful in general. There is not ever going to be one general tool that is dominant; each tool has its domain in which its appropriate. On the other hand, despite having limitations (a good thing!), there is still lots to explore in PGM land. Note that many of the most widely-used graphical models are chains - the HMM is an example, as is the CRF. But beyond chains there are trees and there is still much to do with trees. There's no reason that one can't allow the nodes in graphical models to represent random sets, or random combinatorial general structures, or general stochastic processes; factorizations can be just as useful in such settings as they are in the classical settings of random vectors. There's still lots to explore there."

  (Ferenc Huszar) "My favourite theoretical machine learning papers are ones that interpret heuristic learning algorithms in a probabilistic framework, and uncover that they in fact are doing something profound and meaningful. Being trained as a Bayesian, what I mean by profound typically means statistical inference or fitting statistical models. An example would be the k-means algorithm. K-means intuitively makes sense as an algorithm for clustering. But we only really understand what it does when we make the observation that it actually is a special case of expectation-maximisation in gaussian mixture models. This interpretation as special case of something allows us to understand the expected behaviour of the algorithm better. It will allow us to make predictions about the situations in which it's likely to fail, and to meaningfully extend it to situations it doesn't handle well."

  (Ferenc Huszar) "There is no such thing as learning without priors. In the simplest form, the objective function of the optimisation is a prior - you tell the machine that it's goal is to minimise mean squared error for example. The machine solves the optimisation problem (typically) you tell it to solve, and good machine learning is about figuring out what that problem is. Priors are part of that. Secondly, if you think about it, it is actually a tiny portion of machine learning problems where you actually have enough data to get away without engineering better priors or architectures by just using a model which is highly flexible. Today, you can do this in visual, audio, video domain because you can collect and learn from tonnes of examples and particularly because you can use unsupervised or semi-supervised learning to learn natural invariances. An example is chemistry: if you want to predict certain properties of chemicals, it almost doesn't make sense to use data only to make the machine learn what a chemical is, and what the invariances are - doing that would be less accurate and a lot harder than giving it the required context. Un- and semi-supervised learning doesn't make sense because in many cases learning about the natural distribution of chemicals (even if you had a large dataset of this) may be uninformative of the prediction tasks you want to solve."

  (Ferenc Huszar) "My belief is that speeding up computation is not fast enough, you do need priors to beat the curse of dimensionality. Think rotational invariance. Yes, you can model that by allowing enough flexibility in a neural netowrk to learn separate representations for all possible rotations of an object, but you're exponentially more efficient if you can somehow 'integrate out' the invariance by designing the architecture/maths cleverly. By modeling invariances correctly, you can make exponential leaps in representational capacity of the network - on top of the exponential growth in computing power that'd kind of a given. I don't think the growth in computing power is fast enough to make progress in machine learning for real-world hard tasks. You need that, combined with exponential leaps on top of that, made possible by building in prior knowledge correcltly."

  () "Many labelling problems are probably better solved by (conditional) generative models. Multi-label problems where the labels are not independent are an obvious example. Even in the single label case, I bet it's probably better to represent uncertainty in the appropriate label via multiple modes in the internal behavior of the model, rather than relegating all entropy to the final prediction."

  (Yann LeCun) "I'm a big fan of the conceptual framework of factor graphs as a way to describe learning and inference models. But I think in their simplest/classical form (variable nodes, and factor nodes), factor graphs are insufficient to capture the computational issues. There are factors, say between two variables X and Y, that allow you to easily compute Y from X, but not X from Y, or vice versa. Imagine that X is an image, Y a description of the image, and the factor contains a giant convolutional net that computes a description of the image and measures how well Y matches the computed description. It's easy to infer Y from X, but essentially impossible to infer X from Y. In the real world, where variables are high dimensional and continuous, and where dependencies are complicated, factors are directional."

  () "It's interesting that many summarize Bayesian methods as being about priors; but real power is its focus on integrals and expectations over maximas and modes."

  () "Broadly speaking there are two ways of doing inference in ML. One is integration (which tends to be Bayesian) and the other is optimization (which is usually not). A lot of things that "aren't" Bayesian turn out to be the same algorithm with a different interpretation when you view them from a Bayesian perspective (like ridge regression being a MAP estimate for linear regression with a Gaussian prior). However, there are plenty of things people do that don't fit easily into a Bayesian framework. A few of them that come to mind are random forests, energy based models (in the Yann LeCun sense), and the Apriori and DBSCAN algorithms."

  (Yann LeCun) "There is no opposition between "deep" and "Bayesian". Many deep learning methods are Bayesian, and many more can be made Bayesian if you find that useful. David Mackay, myself and a few colleagues at Bell Labs have worked in the 90s on variational Bayesian methods for getting probabilities out of the neural nets (by integrating over a Gaussian approximation of the weight posterior), RBMs are Bayesian, Variational Auto-Encoders are Bayesian, the view of neural nets as factor graphs is Bayesian."

  (Nando de Freitas) "Some folks use information theory to learn autoencoders - it's not clear what the value of the prior is in this setting. Some are using Bayesian ideas to obtain confidence intervals - but the bootstrap could have been equally used. Where it becomes interesting is where people use ideas of deep learning to do Bayesian inference. An example of this is Kevin Murphy and colleagues using distillation (aka dark knowledge) for reducing the cost of Bayesian model averaging. I also think deep nets have enough power to implement Bayes rule and sampling rules. I strongly believe that Bayesian updating, Bayesian filtering and other forms of computation can be approximated by the type of networks we use these days. A new way of thinking is in the air."

  (Ian Osband) "In sequential decision problems there is an important distinction between risk and uncertainty. We identify risk as inherent stochasticity in a model and uncertainty as the confusion over which model parameters apply. For example, a coin may have a fixed p = 0.5 of heads and so the outcome of any single flip holds some risk; a learning agent may also be uncertain of p. The demarcation between risk and uncertainty is tied to the specific model class, in this case a Bernoulli random variable; with a more detailed model of flip dynamics even the outcome of a coin may not be risky at all. Our distinction is that unlike risk, uncertainty captures the variability of an agent’s posterior belief which can be resolved through statistical analysis of the appropriate data. For a learning agent looking to maximize cumulative utility through time, this distinction represents a crucial dichotomy. Consider the reinforcement learning problem of an agent interacting with its environment while trying to maximize cumulative utility through time. At each timestep, the agent faces a fundamental tradeoff: by exploring uncertain states and actions the agent can learn to improve its future performance, but it may attain better short-run performance by exploiting its existing knowledge. At a high level this effect means uncertain states are more attractive since they can provide important information to the agent going forward. On the other hand, states and action with high risk are actually less attractive for an agent in both exploration and exploitation. For exploitation, any concave utility will naturally penalize risk. For exploration, risk also makes any single observation less informative. Although colloquially similar, risk and uncertainty can require radically different treatment."

  () "While Bayesian inference can capture uncertainty about parameters, it relies on the model being correctly specified. However, in practice, all models are wrong. And in fact, this model mismatch can be often be large enough that we should be more concerned with calibrating our inferences to correct for the mismatch than to produce uncertainty estimates from incorrect assumptions."

  () "While in principle it is nice that we can build models separate from our choice of inference, we often need to combine the two in practice. (The whole naming behind the popular model-inference classes of “variational auto-encoders” and “generative adversarial networks” are one example.) That is, we often choose our model based on what we know enables fast inferences, or we select hyperparameters in our model from data. This goes against the Bayesian paradigm."


selected papers and books - https://dropbox.com/sh/txqk44kqgn1f9t4/AAD3DCEDXjPxGPFOa0SJYJwga


interesting recent papers:
 - variational autoencoders - https://github.com/brylevkirill/notes/blob/master/interesting%20recent%20papers.md#generative-models---variational-autoencoders
 - unsupervised learning - https://github.com/brylevkirill/notes/blob/master/interesting%20recent%20papers.md#unsupervised-learning
 - bayesian inference and learning - https://github.com/brylevkirill/notes/blob/master/interesting%20recent%20papers.md#bayesian-inference-and-learning


interesting papers:
 - bayesian deep learning - https://github.com/brylevkirill/notes/blob/master/Deep%20Learning.md#bayesian-deep-learning
 - variational autoencoder - https://github.com/brylevkirill/notes/blob/master/Deep%20Learning.md#generative-models---variational-autoencoder
 - inference - https://github.com/brylevkirill/notes/blob/master/Probabilistic%20Programming.md#interesting-papers---inference
 - applications - https://github.com/brylevkirill/notes/blob/master/Probabilistic%20Programming.md#interesting-papers---applications


[interesting papers]

Eisner - "Inside-Outside and Forward-Backward Algorithms Are Just Backprop" [https://www.cs.jhu.edu/~jason/papers/eisner.spnlp16.pdf]
	"A probabilistic or weighted grammar implies a posterior probability distribution over possible parses of a given input sentence. One often needs to extract information from this distribution, by computing the expected counts (in the unknown parse) of various grammar rules, constituents, transitions, or states. This requires an algorithm such as inside-outside or forward-backward that is tailored to the grammar formalism. Conveniently, each such algorithm can be obtained by automatically differentiating an “inside” algorithm that merely computes the log-probability of the evidence (the sentence). This mechanical procedure produces correct and efficient code. As for any other instance of back-propagation, it can be carried out manually or by software. This pedagogical paper carefully spells out the construction and relates it to traditional and nontraditional views of these algorithms."

Diaconis - "The Markov Chain Monte Carlo Revolution" [http://math.uchicago.edu/~shmuel/Network-course-readings/MCMCRev.pdf]
	"The use of simulation for high dimensional intractable computations has revolutionized applied mathematics. Designing, improving and understanding the new tools leads to (and leans on) fascinating mathematics, from representation theory through micro-local analysis."

Salimans, Kingma, Welling - "Markov Chain Monte Carlo and Variational Inference: Bridging the Gap" [http://jmlr.org/proceedings/papers/v37/salimans15.pdf]
	"Recent advances in stochastic gradient variational inference have made it possible to perform variational Bayesian inference with posterior approximations containing auxiliary random variables. This enables us to explore a new synthesis of variational inference and Monte Carlo methods where we incorporate one or more steps of MCMC into our variational approximation. By doing so we obtain a rich class of inference algorithms bridging the gap between variational methods and MCMC, and offering the best of both worlds: fast posterior approximation through the maximization of an explicit objective, with the option of trading off additional computation for additional accuracy. We describe the theoretical foundations that make this possible and show some promising first results."

Blei, Kucukelbir, McAuliffe - "Variational Inference: A Review for Statisticians" [http://arxiv.org/abs/1601.00670]
	"One of the core problems of modern statistics is to approximate difficult-to-compute probability distributions. This problem is especially important in Bayesian statistics, which frames all inference about unknown quantities as a calculation about the posterior. In this paper, we review variational inference (VI), a method from machine learning that approximates probability distributions through optimization. VI has been used in myriad applications and tends to be faster than classical methods, such as Markov chain Monte Carlo sampling. The idea behind VI is to first posit a family of distributions and then to find the member of that family which is close to the target. Closeness is measured by Kullback-Leibler divergence. We review the ideas behind mean-field variational inference, discuss the special case of VI applied to exponential family models, present a full example with a Bayesian mixture of Gaussians, and derive a variant that uses stochastic optimization to scale up to massive data. We discuss modern research in VI and highlight important open problems. VI is powerful, but it is not yet well understood. Our hope in writing this paper is to catalyze statistical research on this widely-used class of algorithms."

Kingma, Welling - "Auto-Encoding Variational Bayes" [http://arxiv.org/abs/1312.6114]
	"How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions is two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results."
	--
	"Latent variable probabilistic models are ubiquitous, but often inference in such models is intractable. Variational inference methods based on approximation of the true posterior currently are most popular deterministic inference techniques. Recently one particularly interesting method for parametric variational approximation was proposed called Auto-encoding variational bayes. In this method, approximate posterior explicitly depends on data and may be almost arbitrary complex, e.g. a deep neural network. Thus, the problem of variational inference may be considered as a learning of auto-encoder where the code is represented by latent variables, encoder is the likelihood model and decoder is our variational approximation. Since neural networks can serve as universal function approximators, such inference method may allow to obtain better results than for "shallow" parametric approximations or free-form mean-field ones."
	-- http://youtube.com/watch?v=rjZL7aguLAs (Kingma)
	-- http://videolectures.net/deeplearning2015_courville_autoencoder_extension/ (Courville)
	-- https://dl.dropboxusercontent.com/u/16027344/ICML%202015%20Deep%20Learning%20Workshop/Karol%20Gregor%2C%20GOOGLE%20Deepmind.p2g/Default.html (Gregor)
	-- https://youtu.be/_qrHcSdQ2J4?t=1h37m21s (Vetrov, in russian)
	-- http://hsaghir.github.io/denoising-vs-variational-autoencoder/
	-- http://arxiv.org/abs/1606.05908 + https://github.com/cdoersch/vae_tutorial (tutorial)
	-- http://arxiv.org/abs/1610.09296 (explanation)
	-- http://blog.fastforwardlabs.com/post/148842796218/introducing-variational-autoencoders-in-prose-and + http://blog.fastforwardlabs.com/post/149329060653/under-the-hood-of-the-variational-autoencoder-in
	-- http://kvfrans.com/variational-autoencoders-explained/
	-- http://jaan.io/what-is-variational-autoencoder-vae-tutorial/
	-- http://blog.keras.io/building-autoencoders-in-keras.html
	-- https://github.com/fchollet/keras/blob/master/examples/variational_autoencoder.py
	-- https://jmetzen.github.io/2015-11-27/vae.html
	-- https://github.com/casperkaae/parmesan
	-- https://github.com/arahuja/generative-tf
	-- https://github.com/blei-lab/edward/blob/master/examples/vae_convolutional.py
	-- https://github.com/Kaixhin/Autoencoders/blob/master/models/VAE.lua

Rezende, Mohamed, Wierstra - "Stochastic Backpropagation and Approximate Inference in Deep Generative Models" [http://arxiv.org/abs/1401.4082]
	"We marry ideas from deep neural networks and approximate Bayesian inference to derive a generalised class of deep, directed generative models, endowed with a new algorithm for scalable inference and learning. Our algorithm introduces a recognition model to represent an approximate posterior distribution and uses this for optimisation of a variational lower bound. We develop stochastic backpropagation rules for gradient backpropagation through stochastic variables and derive an algorithm that allows for joint optimisation of the parameters of both the generative and recognition models. We demonstrate on several real-world data sets that by using stochastic backpropagation and variational inference, we obtain models that are able to generate realistic samples of data, allow for accurate imputations of missing data, and provide a useful tool for high-dimensional data visualisation."
	-- http://techtalks.tv/talks/stochastic-backpropagation-and-approximate-inference-in-deep-generative-models/60885/
	-- https://dropbox.com/s/s1mgon5e7lf5svx/Stochastic%20Backpropagation%20and%20Approximate%20Variational%20Inference%20in%20Deep%20Generative%20Models%20%28slides%29.pdf (in russian)

Rezende, Mohamed - "Variational Inference with Normalizing Flows" [http://arxiv.org/abs/1505.05770]
	"The choice of approximate posterior distribution is one of the core problems in variational inference. Most applications of variational inference employ simple families of posterior approximations in order to allow for efficient inference, focusing on mean-field or other simple structured approximations. This restriction has a significant impact on the quality of inferences made using variational methods. We introduce a new approach for specifying flexible, arbitrarily complex and scalable approximate posterior distributions. Our approximations are distributions constructed through a normalizing flow, whereby a simple initial density is transformed into a more complex one by applying a sequence of invertible transformations until a desired level of complexity is attained. We use this view of normalizing flows to develop categories of finite and infinitesimal flows and provide a unified view of approaches for constructing rich posterior approximations. We demonstrate that the theoretical advantages of having posteriors that better match the true posterior, combined with the scalability of amortized variational approaches, provides a clear improvement in performance and applicability of variational inference."
	"We propose the specification of approximate posterior distributions using normalizing flows, a tool for constructing complex distributions by transforming a probability density through a series of invertible mappings. Inference with normalizing flows provides a tighter, modified variational lower bound with additional terms that only add terms with linear time complexity.
	We show that normalizing flows admit infinitesimal flows that allow us to specify a class of posterior approximations that in the asymptotic regime is able to recover the true posterior distribution, overcoming one oft-quoted limitation of variational inference.
	We present a unified view of related approaches for improved posterior approximation as the application of special types of normalizing flows.
	We show experimentally that the use of general normalizing flows systematically outperforms other competing approaches for posterior approximation."
	"In this work we developed a simple approach for learning highly non-Gaussian posterior densities by learning transformations of simple densities to more complex ones through a normalizing flow. When combined with an amortized approach for variational inference using inference networks and efficient Monte Carlo gradient estimation, we are able to show clear improvements over simple approximations on different problems. Using this view of normalizing flows, we are able to provide a unified perspective of other closely related methods for flexible posterior estimation that points to a wide spectrum of approaches for designing more powerful posterior approximations with different statistical and computational tradeoffs. An important conclusion from the discussion in section 3 is that there exist classes of normalizing flows that allow us to create extremely rich posterior approximations for variational inference. With normalizing flows, we are able to show that in the asymptotic regime, the space of solutions is rich enough to contain the true posterior distribution. If we combine this with the local convergence and consistency results for maximum likelihood parameter estimation in certain classes of latent variables models, we see that we are now able overcome the objections to using variational inference as a competitive and default approach for statistical inference. Making such statements rigorous is an important line of future research. Normalizing flows allow us to control the complexity of the posterior at run-time by simply increasing the flow length of the sequence. The approach we presented considered normalizing flows based on simple transformations of the form (10) and (14). These are just two of the many maps that can be used, and alternative transforms can be designed for posterior approximations that may require other constraints, e.g., a restricted support. An important avenue of future research lies in describing the classes of transformations that allow for different characteristics of the posterior and that still allow for efficient, linear-time computation."

Agrawal, Dukkipati - "Deep Variational Inference Without Pixel-Wise Reconstruction" [http://arxiv.org/abs/1611.05209]
	"Variational autoencoders, that are built upon deep neural networks have emerged as popular generative models in computer vision. Most of the work towards improving variational autoencoders has focused mainly on making the approximations to the posterior flexible and accurate, leading to tremendous progress. However, there have been limited efforts to replace pixel-wise reconstruction, which have known shortcomings. In this work, we use real-valued non-volume preserving transformations (real NVP) to exactly compute the conditional likelihood of the data given the latent distribution. We show that a simple VAE with this form of reconstruction is competitive with complicated VAE structures, on image modeling tasks. As part of our model, we develop powerful conditional coupling layers that enable real NVP to learn with fewer intermediate layers."
	"VAPNEV is competitive with convolutional DRAW which is a complicated VAE structure with multiple stochastic layers and recurrent connections. This establishes that replacing pixel-wise reconstruction with exact likelihood methods like real NVP is beneficial to the performance of VAEs. The model is also competitive with real NVP, which uses a much bigger architecture. This shows the power of the conditional coupling layer transform, which is able to effectively utilize the semantic representation learned by the VAE latent distribution."
	"Unlike a regular VAE, a single z might lead to different samples in VAPNEV, because of stochasticity in the Y space."
	"We develop powerful conditional coupling layer transforms which enable the model to learn with smaller architectures. VAPNEV provides a lot of advantages such as (i) it provides a way to replace pixel-wise reconstruction which has known shortcomings, (ii) it gives a generative model which can be trained and sampled from efficiently and (iii) it is a latent variable model which can be used for downstream supervised or semi-supervised learning. This work can be extended in several ways. Using deeper architectures, and combining with expressive posterior computations like inverse autoregressive flow, it may be possible to compete with or even beat state-of-the-art models. This technique can be used to improve VAE models for other tasks such as semi-supervised learning and conditional density modeling."

Edwards, Storkey - "Towards a Neural Statistician" [http://arxiv.org/abs/1606.02185]
	"An efficient learner is one who reuses what they already know to tackle a new problem. For a machine learner, this means understanding the similarities amongst datasets. In order to do this, one must take seriously the idea of working with datasets, rather than datapoints, as the key objects to model. Towards this goal, we demonstrate an extension of a variational autoencoder that can learn a method for computing representations, or statistics, of datasets in an unsupervised fashion. The network is trained to produce statistics that encapsulate a generative model for each dataset. Hence the network enables efficient learning from new datasets for both unsupervised and supervised tasks. We show that we are able to learn statistics that can be used for: clustering datasets, transferring generative models to new datasets, selecting representative samples of datasets and classifying previously unseen classes."
	"Our goal was to demonstrate that it is both possible and profitable to work at a level of abstraction of datasets rather than just datapoints. We have shown how it is possible to learn to represent datasets using a statistic network, and that these statistics enable highly flexible and efficient models that can do transfer learning, small shot classification, cluster distributions, summarize datasets and more. Avenues for future research are engineering, methodological and application based. In terms of engineering we believe that there are gains to be had by more thorough exploration of different (larger) architectures. In terms of methodology we want to look at: improved methods of representing uncertainty resulting from sample size; models explicitly designed trained for small-shot classification; supervised and semi-supervised approaches to classifiying either datasets or datapoints within the dataset. One advantage we have yet to explore is that by specifying classes implicitly in terms of sets, we can combine multiple data sources with potentially different labels, or multiple labels. We can also easily train on any unlabelled data because this corresponds to sets of size one. We also want to consider questions such as: What are desirable properties for statistics to have as representations? How can we enforce these? Can we use ideas from classical work on estimators? In terms of applications we are interested in applying this framework to learning embeddings of speakers for speech problems or customer embeddings in commercial problems."
	"Potentially a more powerful alternative to Variational Autoencoder."
	-- http://techtalks.tv/talks/neural-statistician/63048/ (Edwards)
	-- https://youtu.be/XpIDCzwNe78?t=51m53s (Bartunov)
	-- http://www.shortscience.org/paper?bibtexKey=journals/corr/1606.02185

Gal, Ghahramani - "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning" [http://arxiv.org/abs/1506.02142]
	"Deep learning has gained tremendous attention in applied machine learning. However such tools for regression and classification do not capture model uncertainty. Bayesian models offer a mathematically grounded framework to reason about model uncertainty, but usually come with a prohibitive computational cost. We show that dropout in neural networks can be cast as a Bayesian approximation. As a direct result we obtain tools to model uncertainty with dropout NNs - extracting information from existing models that has been thrown away so far. This mitigates the problem of representing uncertainty in deep learning without sacrificing computational complexity or test accuracy. We perform an extensive study of the dropout uncertainty properties. Various network architectures and non-linearities are assessed on tasks of regression and classification, using MNIST as an example. We show a considerable improvement in predictive log-likelihood and RMSE compared to existing state-of-the-art methods. We finish by using dropout uncertainty in a Bayesian pipeline, with deep reinforcement learning as a practical task."
	"We have built a probabilistic interpretation of dropout which allowed us to obtain model uncertainty out of existing deep learning models. We have studied the properties of this uncertainty in detail, and demonstrated possible applications, interleaving Bayesian models and deep learning models together. This extends on initial research studying dropout from the Bayesian perspective. Bernoulli dropout is only one example of a regularisation technique corresponding to an approximate variational distribution which results in uncertainty estimates. Other variants of dropout follow our interpretation as well and correspond to alternative approximating distributions. These would result in different uncertainty estimates, trading-off uncertainty quality with computational complexity. We explore these in follow-up work. Furthermore, each GP covariance function has a one-to-one correspondence with the combination of both NN non-linearities and weight regularisation. This suggests techniques to select appropriate NN structure and regularisation based on our a-priori assumptions about the data. For example, if one expects the function to be smooth and the uncertainty to increase far from the data, cosine nonlinearities and L2 regularisation might be appropriate. The study of non-linearity–regularisation combinations and the corresponding predictive mean and variance are subject of current research."
	"Deep learning has attracted tremendous attention from researchers in fields such as physics, biology, and manufacturing, to name a few. Tools such as the neural network, dropout, convolutional neural networks, and others are used extensively. However, these are fields in which representing model uncertainty is of crucial importance. With the recent shift in many of these fields towards the use of Bayesian uncertainty new needs arise from deep learning tools. Standard deep learning tools for regression and classification do not capture model uncertainty. In classification, predictive probabilities obtained at the end of the pipeline (the softmax output) are often erroneously interpreted as model confidence. A model can be uncertain in its predictions even with a high softmax output. Passing a point estimate of a function through a softmax results in extrapolations with unjustified high confidence for points far from the training data. However, passing the distribution through a softmax better reflects classification uncertainty far from the training data. Model uncertainty is indispensable for the deep learning practitioner as well. With model confidence at hand we can treat uncertain inputs and special cases explicitly. For example, in the case of classification, a model might return a result with high uncertainty. In this case we might decide to pass the input to a human for classification. This can happen in a post office, sorting letters according to their zip code, or in a nuclear power plant with a system responsible for critical infrastructure. Uncertainty is important in reinforcement learning as well. With uncertainty information an agent can decide when to exploit and when to explore its environment. Recent advances in RL have made use of NNs for Q-value function approximation. These are functions that estimate the quality of different actions an agent can make. Epsilon greedy search is often used where the agent selects its best action with some probability and explores otherwise. With uncertainty estimates over the agent’s Q-value function, techniques such as Thompson sampling can be used to learn much faster."
	"Bayesian probability theory offers us mathematically grounded tools to reason about model uncertainty, but these usually come with a prohibitive computational cost. It is perhaps surprising then that it is possible to cast recent deep learning tools as Bayesian models – without changing either the models or the optimisation. We show that the use of dropout (and its variants) in NNs can be interpreted as a Bayesian approximation of a well known probabilistic model: the Gaussian process. Dropout is used in many models in deep learning as a way to avoid over-fitting, and our interpretation suggests that dropout approximately integrates over the models’ weights. We develop tools for representing model uncertainty of existing dropout NNs – extracting information that has been thrown away so far. This mitigates the problem of representing model uncertainty in deep learning without sacrificing either computational complexity or test accuracy. In this paper we give a complete theoretical treatment of the link between Gaussian processes and dropout, and develop the tools necessary to represent uncertainty in deep learning. We perform an extensive exploratory assessment of the properties of the uncertainty obtained from dropout NNs and convnets on the tasks of regression and classification. We compare the uncertainty obtained from different model architectures and non-linearities in regression, and show that model uncertainty is indispensable for classification tasks, using MNIST as a concrete example. We then show a considerable improvement in predictive log-likelihood and RMSE compared to existing state-ofthe-art methods. Lastly we give a quantitative assessment of model uncertainty in the setting of reinforcement learning, on a practical task similar to that used in deep reinforcement learning."
	"It has long been known that infinitely wide (single hidden layer) NNs with distributions placed over their weights converge to Gaussian processes. This known relation is through a limit argument that does not allow us to translate properties from the Gaussian process to finite NNs easily. Finite NNs with distributions placed over the weights have been studied extensively as Bayesian neural networks. These offer robustness to over-fitting as well, but with challenging inference and additional computational costs. Variational inference has been applied to these models, but with limited success. Recent advances in variational inference introduced new techniques into the field such as sampling-based variational inference and stochastic variational inference. These have been used to obtain new approximations for Bayesian neural networks that perform as well as dropout. However these models come with a prohibitive computational cost. To represent uncertainty, the number of parameters in these models is doubled for the same network size. Further, they require more time to converge and do not improve on existing techniques. Given that good uncertainty estimates can be cheaply obtained from common dropout models, this results in unnecessary additional computation. An alternative approach to variational inference makes use of expectation propagation and has improved considerably in RMSE and uncertainty estimation on VI approaches. In the results section we compare dropout to these approaches and show a significant improvement in both RMSE and uncertainty estimation."
	-- http://arxiv.org/abs/1506.02157 (Appendix)
	-- http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html (Gal)
	-- https://www.evernote.com/shard/s189/sh/0b46fb48-dd1a-4e3b-ac5c-289f4925ff7e/3f0f03231757ded363b42ce71ebfcc70 (Larochelle)
	-- https://plus.google.com/u/0/+AnkurHanda/posts/DnXB81efTwa

Herlau, Morup, Schmidt - "Bayesian Dropout" [http://arxiv.org/abs/1508.02905]
	"Dropout has recently emerged as a powerful and simple method for training neural networks preventing co-adaptation by stochastically omitting neurons. Dropout is currently not grounded in explicit modelling assumptions which so far has precluded its adoption in Bayesian modelling. Using Bayesian entropic reasoning we show that dropout can be interpreted as optimal inference under constraints. We demonstrate this on an analytically tractable regression model providing a Bayesian interpretation of its mechanism for regularizing and preventing co-adaptation as well as its connection to other Bayesian techniques. We also discuss two general approximate techniques for applying Bayesian dropout for general models, one based on an analytical approximation and the other on stochastic variational techniques. These techniques are then applied to a Baysian logistic regression problem and are shown to improve performance as the model become more misspecified. Our framework roots dropout as a theoretically justified and practical tool for statistical modelling allowing Bayesians to tap into the benefits of dropout training."
	"Dropout provides a simple yet powerful tool to avoid co-adaptation in neural networks and has been shown to offer tangible benefits. However, its formulation as an algorithm rather than as a set of probabilistic assumptions precludes its use in Bayesian modelling. We have shown how dropout can be interpreted as optimal inference under a particular constraint. This qualifies dropout beyond being a particular optimization procedure, and has the advantage of giving researchers who want to apply dropout to a particular model a principled way to do so. We have demonstrated Bayesian dropout on an analytically tractable regression model, providing a probabilistic interpretation of its mechanisms for regularizing and preventing co-adaptation as well as its connection to other Bayesian techniques. In our experiments we find that dropout can provide robustness under model misspecification, and offer benefits over ordinary Bayesian linear regression in a real dataset. We also discussed two schemes which allow dropout to be applied in a wider setting. One based on an analytical approximation to the dropout target, the other based on stochastic variational Bayes which, by only requiring an unbiased estimator of the true dropout target, seems nearly ideally suited for dropout. When these techniques were applied to a Bayesian logistic regression problem we found stochastic variational Bayes to have some significant convergence difficulties; notice these were also found for ordinary Bayesian logistic regression without dropout and require further investigation. By increasing the effort of stochastic variational Bayes we arrived at estimates which showed good qualitative agreement with the analytical approximation as evaluated by Hamiltonian Markov chain Monte Carlo. Both approximations showed dropout to have little or no effect in the well-specified regime, however when the number of spurious features were increased dropout led to large increases in performance. In a larger scope, we believe the view that probabilistic modelling may be thought to consist of not only specifying a uniquely optimal model, but as posing general restrictions on the model class provide an important departure from the existent Bayesian paradigm. If this is ultimately true, however, require the method demonstrate its versatility and we see the present formalism of dropout as a single step on this path."

Johnson, Duvenaud, Wiltschko, Datta, Adams - "Structured VAEs: Composing Probabilistic Graphical Models and Variational Autoencoders" [http://arxiv.org/abs/1603.06277]
	"We develop a new framework for unsupervised learning that composes probabilistic graphical models with deep learning methods and combines their respective strengths. Our method uses graphical models to express structured probability distributions and recent advances from deep learning to learn flexible feature models and bottom-up recognition networks. All components of these models are learned simultaneously using a single objective, and we develop scalable fitting algorithms that can leverage natural gradient stochastic variational inference, graphical model message passing, and backpropagation with the reparameterization trick. We illustrate this framework with a new structured time series model and an application to mouse behavioral phenotyping."
	"Each frame of the video is a depth image of a mouse in a particular pose, and so even though each image is encoded as 30 × 30 = 900 pixels, the data lie near a low-dimensional nonlinear manifold. A good generative model must not only learn this manifold but also represent many other salient aspects of the data. For example, from one frame to the next the corresponding manifold points should be close to one another, and in fact the trajectory along the manifold may follow very structured dynamics. To inform the structure of these dynamics, a natural class of hypotheses used in ethology and neurobiology is that the mouse’s behavior is composed of brief, reused actions, such as darts, rears, and grooming bouts. Therefore a natural representation would include discrete states with each state representing the simple dynamics of a particular primitive action, a representation that would be difficult to encode in an unsupervised recurrent neural network model. These two tasks, of learning the image manifold and learning a structured dynamics model, are complementary: we want to learn the image manifold not just as a set but in terms of manifold coordinates in which the structured dynamics model fits the data well. A similar modeling challenge arises in speech, where high-dimensional data lie near a low-dimensional manifold because they are generated by a physical system with relatively few degrees of freedom but also include the discrete latent dynamical structure of phonemes, words, and grammar."
	"Our approach uses graphical models for representing structured probability distributions, and uses ideas from variational autoencoders for learning not only the nonlinear feature manifold but also bottom-up recognition networks to improve inference. Thus our method enables the combination of flexible deep learning feature models with structured Bayesian and even Bayesian nonparametric priors. Our approach yields a single variational inference objective in which all components of the model are learned simultaneously. Furthermore, we develop a scalable fitting algorithm that combines several advances in efficient inference, including stochastic variational inference, graphical model message passing, and backpropagation with the reparameterization trick."
	-- http://github.com/mattjj/svae

Papamakarios, Murray - "Fast Epsilon-free Inference of Simulation Models with Bayesian Conditional Density Estimation" [http://arxiv.org/abs/1605.06376]
	"Many statistical models can be simulated forwards but have intractable likelihoods. Approximate Bayesian Computation (ABC) methods are used to infer properties of these models from data. Traditionally these methods approximate the posterior over parameters by conditioning on data being inside an Epsilon-ball around the observed data, which is only correct in the limit Epsilon→ 0. Monte Carlo methods can then draw samples from the approximate posterior to approximate predictions or error bars on parameters. These algorithms critically slow down as Epsilon→ 0, and in practice draw samples from a broader distribution than the posterior. We propose a new approach to likelihood-free inference based on Bayesian conditional density estimation. Preliminary inferences based on limited simulation data are used to guide later simulations. In some cases, learning an accurate parametric representation of the entire true posterior distribution requires fewer model simulations than Monte Carlo ABC methods need to produce a single sample from an approximate posterior."
	"Conventional ABC algorithms such as the above suffer from three drawbacks. First, they only represent the parameter posterior as a set of (possibly weighted or correlated) samples. A sample-based representation easily gives estimates and error bars of individual parameters, and model predictions. However these computations are noisy, and it is not obvious how to perform some other computations using samples, such as combining posteriors from two separate analyses. Second, the parameter samples do not come from the correct Bayesian posterior, but from an approximation based on assuming a pseudo-observation that the data is within an Epsilon-ball centred on the data actually observed. Third, as the Epsilon-tolerance is reduced, it can become impractical to simulate the model enough times to match the observed data even once. When simulations are expensive to perform, good quality inference becomes impractical. In this paper, we propose an alternative approach to likelihood-free inference, which unlike conventional ABC does not suffer from the above three issues. Instead of returning a set of parameter samples from an approximate posterior, our approach learns a parametric approximation to the exact posterior, which can be made as accurate as required. Furthermore, we present a strategy for learning our parametric approximation by making efficient use of simulations from the model. We show experimentally that our approach is capable of closely approximating the exact posterior, while making efficient use of simulations compared to conventional ABC. Our approach is based on conditional density estimation using Bayesian neural networks, and draws upon advances in density estimation, stochastic variational inference, and recognition networks. To the best of our knowledge, this is the first work that applies such techniques to the field of likelihood-free inference."
	-- http://dennisprangle.github.io/research/2016/06/07/bayesian-inference-by-neural-networks + http://dennisprangle.github.io/research/2016/06/07/bayesian-inference-by-neural-networks2


[interesting papers - applications]

Rezende, Eslami, Mohamed, Battaglia, Jaderberg, Heess - "Unsupervised Learning of 3D Structure from Images" [http://arxiv.org/abs/1607.00662]
	"A key goal of computer vision is to recover the underlying 3D structure from 2D observations of the world. In this paper we learn strong deep generative models of 3D structures, and recover these structures from 3D and 2D images via probabilistic inference. We demonstrate high-quality samples and report log-likelihoods on several datasets, including ShapeNet, and establish the first benchmarks in the literature. We also show how these models and their inference networks can be trained end-to-end from 2D images. This demonstrates for the first time the feasibility of learning to infer 3D representations of the world in a purely unsupervised manner."
	"A key goal of computer vision is that of recovering the underlying 3D structure that gives rise to these 2D observations. The 2D projection of a scene is a complex function of the attributes and positions of the camera, lights and objects that make up the scene. If endowed with 3D understanding, agents can abstract away from this complexity to form stable, disentangled representations, e.g., recognizing that a chair is a chair whether seen from above or from the side, under different lighting conditions, or under partial occlusion. Moreover, such representations would allow agents to determine downstream properties of these elements more easily and with less training, e.g., enabling intuitive physical reasoning about the stability of the chair, planning a path to approach it, or figuring out how best to pick it up or sit on it. Models of 3D representations also have applications in scene completion, denoising, compression and generative virtual reality."
	"There have been many attempts at performing this kind of reasoning, dating back to the earliest years of the field. Despite this, progress has been slow for several reasons: First, the task is inherently ill-posed. Objects always appear under self-occlusion, and there are an infinite number of 3D structures that could give rise to a particular 2D observation. The natural way to address this problem is by learning statistical models that recognize which 3D structures are likely and which are not. Second, even when endowed with such a statistical model, inference is intractable. This includes the sub-tasks of mapping image pixels to 3D representations, detecting and establishing correspondences between different images of the same structures, and that of handling the multi-modality of the representations in this 3D space. Third, it is unclear how 3D structures are best represented, e.g., via dense volumes of voxels, via a collection of vertices, edges and faces that define a polyhedral mesh, or some other kind of representation. Finally, ground-truth 3D data is difficult and expensive to collect and therefore datasets have so far been relatively limited in size and scope."
	"(a) We design a strong generative model of 3D structures, defined over the space of volumes and meshes, using ideas from state-of-the-art generative models of images.
	(b) We show that our models produce high-quality samples, can effectively capture uncertainty and are amenable to probabilistic inference, allowing for applications in 3D generation and simulation. We report log-likelihoods on a dataset of shape primitives, a 3D version of MNIST, and on ShapeNet, which to the best of our knowledge, constitutes the first quantitative benchmark for 3D density modeling.
	(c) We show how complex inference tasks, e.g., that of inferring plausible 3D structures given a 2D image, can be achieved using conditional training of the models. We demonstrate that such models recover 3D representations in one forward pass of a neural network and they accurately capture the multi-modality of the posterior.
	(d) We explore both volumetric and mesh-based representations of 3D structure. The latter is achieved by flexible inclusion of off-the-shelf renders such as OpenGL. This allows us to build in further knowledge of the rendering process, e.g., how light bounces of surfaces and interacts with its material’s attributes.
	(e) We show how the aforementioned models and inference networks can be trained end-to-end directly from 2D images without any use of ground-truth 3D labels. This demonstrates for the first time the feasibility of learning to infer 3D representations of the world in a purely unsupervised manner."
	"In this paper we introduced a powerful family of 3D generative models inspired by recent advances in image modeling. We showed that when trained on ground-truth volumes, they can produce high-quality samples that capture the multi-modality of the data. We further showed how common inference tasks, such as that of inferring a posterior over 3D structures given a 2D image, can be performed efficiently via conditional training. We also demonstrated end-to-end training of such models directly from 2D images through the use of differentiable renderers. We experimented with two kinds of 3D representations: volumes and meshes. Volumes are flexible and can capture a diverse range of structures, however they introduce modeling and computational challenges due to their high dimensionality. Conversely, meshes can be much lower dimensional and therefore easier to work with, and they are the data-type of choice for common rendering engines, however standard paramaterizations can be restrictive in the range of shapes they can capture."
	-- https://youtube.com/watch?v=stvDAGQwL5c + https://goo.gl/9hCkxs (demos)
	-- https://docs.google.com/presentation/d/12uZQ_Vbvt3tzQYhWR3BexqOzbZ-8AeT_jZjuuYjPJiY/pub?start=true&loop=true&delayms=30000#slide=id.g1329951dde_0_0 (demos)

Lake, Salakhutdinov, Tenenbaum - "Human-level Concept Learning Through Probabilistic Program Induction" [http://web.mit.edu/cocosci/Papers/Science-2015-Lake-1332-8.pdf]
	"People learning new concepts can often generalize successfully from just a single example, yet machine learning algorithms typically require tens or hundreds of examples to perform with similar accuracy. People can also use learned concepts in richer ways than conventional algorithms - for action, imagination, and explanation. We present a computational model that captures these human learning abilities for a large class of simple visual concepts: handwritten characters from the world’s alphabets. The model represents concepts as simple programs that best explain observed examples under a Bayesian criterion. On a challenging one-shot classification task, the model achieves human-level performance while outperforming recent deep learning approaches. We also present several “visual Turing tests” probing the model’s creative generalization abilities, which in many cases are indistinguishable from human behavior."
	--
	"Vision program outperformed humans in identifying handwritten characters, given single training example"
	"This work brings together three key ideas -- compositionality, causality, and learning-to-learn --- challenging (in a good way) the traditional deep learning approach"
	-- http://youtube.com/watch?v=kzl8Bn4VtR8 (Lake)
	-- http://youtu.be/quPN7Hpk014?t=21m5s (Tenenbaum)
	-- http://techtalks.tv/talks/one-shot-learning-of-simple-fractal-concepts/63049/ (Lake)
	-- https://reddit.com/r/MachineLearning/comments/3x4ml0/with_the_focus_on_deep_learning_what_problems/cy1nsuh
	-- https://github.com/brendenlake/BPL
	-- http://cims.nyu.edu/~brenden/supplemental/turingtests/turingtests.html

Herbrich, Minka, Graepel - "TrueSkill(TM): A Bayesian Skill Rating System" [http://research.microsoft.com/apps/pubs/default.aspx?id=67956]
	"We present a new Bayesian skill rating system which can be viewed as a generalisation of the Elo system used in Chess. The new system tracks the uncertainty about player skills, explicitly models draws, can deal with any number of competing entities and can infer individual skills from team results. Inference is performed by approximate message passing on a factor graph representation of the model. We present experimental evidence on the increased accuracy and convergence speed of the system compared to Elo and report on our experience with the new rating system running in a large-scale commercial online gaming service under the name of TrueSkill."
	-- http://moserware.com/2010/03/computing-your-skill.html

Stern, Herbrich, Graepel - "Matchbox: Large Scale Bayesian Recommendations" [http://research.microsoft.com/apps/pubs/default.aspx?id=79460]
	"We present a probabilistic model for generating personalised recommendations of items to users of a web service. The Matchbox system makes use of content information in the form of user and item meta data in combination with collaborative filtering information from previous user behavior in order to predict the value of an item for a user. Users and items are represented by feature vectors which are mapped into a low-dimensional ‘trait space’ in which similarity is measured in terms of inner products. The model can be trained from different types of feedback in order to learn user-item preferences. Here we present three alternatives: direct observation of an absolute rating each user gives to some items, observation of a binary preference (like/ don’t like) and observation of a set of ordinal ratings on a userspecific scale. Efficient inference is achieved by approximate message passing involving a combination of Expectation Propagation and Variational Message Passing. We also include a dynamics model which allows an item’s popularity, a user’s taste or a user’s personal rating scale to drift over time. By using Assumed-Density Filtering for training, the model requires only a single pass through the training data. This is an on-line learning algorithm capable of incrementally taking account of new data so the system can immediately reflect the latest user preferences. We evaluate the performance of the algorithm on the MovieLens and Netflix data sets consisting of approximately 1,000,000 and 100,000,000 ratings respectively. This demonstrates that training the model using the on-line ADF approach yields state-of-the-art performance with the option of improving performance further if computational resources are available by performing multiple EP passes over the training data."

Kumar, Tomkins, Vassilvitskii, Vee - "Inverting a Steady-State" [http://theory.stanford.edu/~sergei/papers/wsdm15-cset.pdf]
	"We consider the problem of inferring choices made by users based only on aggregate data containing the relative popularity of each item. We propose a framework that models the problem as that of inferring a Markov chain given a stationary distribution. Formally, we are given a graph and a target steady-state distribution on its nodes. We are also given a mapping from per-node scores to a transition matrix, from a broad family of such mappings. The goal is to set the scores of each node such that the resulting transition matrix induces the desired steady state. We prove sufficient conditions under which this problem is feasible and, for the feasible instances, obtain a simple algorithm for a generic version of the problem. This iterative algorithm provably finds the unique solution to this problem and has a polynomial rate of convergence; in practice we find that the algorithm converges after fewer than ten iterations. We then apply this framework to choice problems in online settings and show that our algorithm is able to explain the observed data and predict the user choices much better than other competing baselines across a variety of diverse datasets."

Zheng, Jayasumana, Romera-Paredes, Vineet, Su, Du, Huang, Torr - "Conditional Random Fields as Recurrent Neural Networks" [http://www.robots.ox.ac.uk/~szheng/papers/CRFasRNN.pdf]
	"Pixel-level labelling tasks, such as semantic segmentation, play a central role in image understanding. Recent approaches have attempted to harness the capabilities of deep learning techniques for image recognition to tackle pixellevel labelling tasks. One central issue in this methodology is the limited capacity of deep learning techniques to delineate visual objects. To solve this problem, we introduce a new form of convolutional neural network that combines the strengths of Convolutional Neural Networks and Conditional Random Fields -based probabilistic graphical modelling. To this end, we formulate Conditional Random Fields as Recurrent Neural Networks. This network, called CRF-RNN, is then plugged in as a part of a CNN to obtain a deep network that has desirable properties of both CNNs and CRFs. Importantly, our system fully integrates CRF modelling with CNNs, making it possible to train the whole deep network end-to-end with the usual back-propagation algorithm, avoiding offline postprocessing methods for object delineation. We apply the proposed method to the problem of semantic image segmentation, obtaining top results on the challenging Pascal VOC 2012 segmentation benchmark."
	-- http://www.robots.ox.ac.uk/~szheng/crfasrnndemo

Huang, Murphy - "Efficient Inference in Occlusion-aware Generative Models of Images" [http://arxiv.org/abs/1511.06362]
	"We present a generative model of images based on layering, in which image layers are individually generated, then composited from front to back. We are thus able to factor the appearance of an image into the appearance of individual objects within the image --- and additionally for each individual object, we can factor content from pose. Unlike prior work on layered models, we learn a shape prior for each object/layer, allowing the model to tease out which object is in front by looking for a consistent shape, without needing access to motion cues or any labeled data. We show that ordinary stochastic gradient variational bayes, which optimizes our fully differentiable lower-bound on the log-likelihood, is sufficient to learn an interpretable representation of images. Finally we present experiments demonstrating the effectiveness of the model for inferring foreground and background objects in images."
	"We have shown how to combine an old idea - of interpretable, generative, layered models of images - with modern techniques of deep learning, in order to tackle the challenging problem of intepreting images in the presence of occlusion in an entirely unsupervised fashion. We see this is as a crucial stepping stone to future work on deeper scene understanding, going beyond simple feedforward supervised prediction problems. In the future, we would like to apply our approach to real images, and possibly video. This will require extending our methods to use convolutional networks, and may also require some weak supervision (e.g., in the form of observed object class labels associated with layers) or curriculum learning to simplify the learning task."

Wilson, Dann, Lucas, Xing - "The Human Kernel" [http://arxiv.org/abs/1510.07389]
	"Bayesian nonparametric models, such as Gaussian processes, provide a compelling framework for automatic statistical modelling: these models have a high degree of flexibility, and automatically calibrated complexity. However, automating human expertise remains elusive; for example, Gaussian processes with standard kernels struggle on function extrapolation problems that are trivial for human learners. In this paper, we create function extrapolation problems and acquire human responses, and then design a kernel learning framework to reverse engineer the inductive biases of human learners across a set of behavioral experiments. We use the learned kernels to gain psychological insights and to extrapolate in humanlike ways that go beyond traditional stationary and polynomial kernels. Finally, we investigate Occam’s razor in human and Gaussian process based function learning."
	"We have shown that (1) human learners have systematic expectations about smooth functions that deviate from the inductive biases inherent in the RBF kernels that have been used in past models of function learning; (2) it is possible to extract kernels that reproduce qualitative features of human inductive biases, including the variable sawtooth and step patterns; (3) that human learners favour smoother or simpler functions, even in comparison to GP models that tend to over-penalize complexity; and (4) that it is possible to build models that extrapolate in human-like ways which go beyond traditional stationary and polynomial kernels."
	"We have focused on human extrapolation from noise-free nonparametric relationships. This approach complements past work emphasizing simple parametric functions and the role of noise, but kernel learning might also be applied in these other settings. In particular, iterated learning experiments provide a way to draw samples that reflect human learners’ a priori expectations. Like most function learning experiments, past IL experiments have presented learners with sequential data. Our approach, following Little and Shiffrin, instead presents learners with plots of functions. This method is useful in reducing the effects of memory limitations and other sources of noise (e.g., in perception). It is possible that people show different inductive biases across these two presentation modes. Future work, using multiple presentation formats with the same underlying relationships, will help resolve these questions. Finally, the ideas discussed in this paper could be applied more generally, to discover interpretable properties of unknown models from their predictions. Here one encounters fascinating questions at the intersection of active learning, experimental design, and information theory."
	-- http://research.microsoft.com/apps/video/default.aspx?id=259610 (Wilson, 11:30)
	-- http://functionlearning.com

Kucukelbir, Ranganath, Gelman, Blei - "Automatic Variational Inference in Stan" [http://arxiv.org/abs/1506.03431]
	"Variational inference is a scalable technique for approximate Bayesian inference. Deriving variational inference algorithms requires tedious model-specific calculations; this makes it difficult to automate. We propose an automatic variational inference algorithm, automatic differentiation variational inference. The user only provides a Bayesian model and a dataset; nothing else. We make no conjugacy assumptions and support a broad class of models. The algorithm automatically determines an appropriate variational family and optimizes the variational objective. We implement ADVI in Stan (code available now), a probabilistic programming framework. We compare ADVI to MCMC sampling across hierarchical generalized linear models, nonconjugate matrix factorization, and a mixture model. We train the mixture model on a quarter million images. With ADVI we can use variational inference on any model we write in Stan."
	"We develop automatic differentiation variational inference in Stan. ASVI leverages automatic transformations, an implicit non-Gaussian variational approximation, and automatic differentiation. This is a valuable tool. We can explore many models, and analyze large datasets with ease."
	-- http://research.microsoft.com/apps/video/default.aspx?id=259601 (Kucukelbir, 18:30)

Kucukelbir, Tran, Ranganath, Gelman, Blei - "Automatic Differentiation Variational Inference" [http://arxiv.org/abs/1603.00788]
	"Probabilistic modeling is iterative. A scientist posits a simple model, fits it to her data, refines it according to her analysis, and repeats. However, fitting complex models to large data is a bottleneck in this process. Deriving algorithms for new models can be both mathematically and computationally challenging, which makes it difficult to efficiently cycle through the steps. To this end, we develop automatic differentiation variational inference. Using our method, the scientist only provides a probabilistic model and a dataset, nothing else. ADVI automatically derives an efficient variational inference algorithm, freeing the scientist to refine and explore many models. ADVI supports a broad class of models - no conjugacy assumptions are required. We study ADVI across ten different models and apply it to a dataset with millions of observations. ADVI is integrated into Stan, a probabilistic programming system; it is available for immediate use."


<brylevkirill (at) gmail.com>