Jaynes.txt

Todo
  Rewrite finite sets policy part
  Rewrite P(A|B) derivation.

Note Notes
  iiuc - A bit unsure(Like below 80%), but if I understand correctly...
  mbg - Quite unsure, but My best guess would be something like...
  tbh idk/tbh idu - To be Honest I don't know/didn't understand...

Probability Theory - The Logic of Science by E.T. Jaynes
  Why this book good?
    E.Y. says it's sexy.
  
  "Preface"
    Why might You be interested in this book?
      Jaynes says:
        Book describes inference.
        The prob theory explained in this book is more practical than classic
        statistics.
          The book resolves many known problems in classic statistics.
          The book contains new results which would be hard to prove using
          classic statistics.
      
    "History"
      What was the point of this subchapter?
        Answering the question -
        Why You should trust Jaynes?
      
      The book actually contains useful results.
        The resulting theory gives pretty much the same results as classic
        stats, where it works.
        
        Many struggles classic stats have, are explained and solved using the
        book's prob theory.
        
        Many Bayesian struggles of the past are explained and solved in the 
        book.
        
        The book contains new useful results hard or impossible to prove in 
        classic stats.
        
      The book is quite rigorous in a practical sense.
        It starts off with some reasonable rules/axioms about how
        common sense and acting on probabilities should idealy work.
          @Though some of them don't seem obvious to me.
        
        Then proves that there's only one way of thinking and doing prob
        theory, based on the axioms.
        
        And notes where it's using approximations or assumptions, which 
        might cause confusion or paradoxes if not noted.
        
        Jaynes has spent like 40 years giving lectures and thinking about 
        the details and consequences.
        
      Why are the rules/axioms Jaynes accepts trustworthy?
        The resulting theory works like classic stats, 
          when the frequency interpretation of possibility makes sense.
        But the theory also gives useful results much less awkwardly in 
        situations where
          interpreting possiblity as 
            You having partial knowledge about the truth
          makes more sense than 
            interpreting it as a frequency.
            e.g. so far most obviously
              Chosing between two hypotheses.
        
        As to Jaynes's official explanation iiuc:
          G. Pólya showed that there were some strong common sense rules as
          to how the possiblity of something seems to changes.
          He didn't know how to describe it with numbers.
          He just had rules for what makes something more or less possible.

          Possibility here is intended to mean something like -
            How likely is something to be true?
              e.g.
                What's the possibility that this man is guilty of crime?
                What's the possibility of the next coinflip being heads?
          and not -
            What's the proportion of positive agains the total cases?
              (Which is the interpretation in classic stats)
              e.g.
                What's the proportion/frequency of throwing heads in coinflips?
                What's the proportion/frequency of throwing double six's 
                in monopoly?
          
          In addition to G. Pólya's rules,
          R. T. Cox made up some rules about consistency of assigning and 
          acting on probabilities.
          He then proved that any way of thinking and acting on 
          probabilities is isomorphic to the way it is done in this book,
          or it breaks one of Pólyas common sense rules, or Cox's rational
          consistency rules.
          
          These rules and the proof are described in chap 1 and 2.
          
          What does isomorphic mean?
            Roughly?
              Two things are called isomorphic,
              if You can prove that they work in the same way, but have
              different names assigned to stuff.
              Then if You prove something is true in one thing,
              by analogy it's true in the other thing.
            
              @I'm a bit unsure about what isomorphism means in the book.
              @It can mean a lot of things based on what stuff can change
              and what has to be left constant.
            
            Slightly more formal example of how isomorphism could be defined?
              I'll call a "thing" a set and functions defined on a set.
              
              Two things are isomorphic if You can rename the elements of the
              sets so that both things become equal.
              Name colisions are not allowed
              
              e.g.
                Thing 1 - set S = {A, B}, function f:{A -> B, B -> A}
                Thing 2 - set S2 = {0, 1}, function g:{0 -> 1, 1 -> 0}
                If You rename
                  A -> 0
                  B -> 1
                  f -> g
                  S -> S2
                Then thing 1 becomes thing 2.
                
                Thing 1 - set S = R, function + (The plus sign that we all 
                know and love)
                Thing 2 - set S2 = R+, function * (The multiplication sign
                we all know and love)
                If we take thing 1
                and rename
                  All numbers x -> e ^ x
                  + -> *
                  S -> S2
                Then thing 1 becomes thing 2.
                
              anti e.g.
                Thing 1 - set S = {0}
                Thing 2 - set S2 = {0, 1}
                Since name collisions aren't allowed, You can't give two 
                numbers the same name.
                
                Thing 1 - set S = {0, 1}, f:{0 -> 0, 1 -> 1}
                Thing 2 - set S2 = {0, 1}, g:{0 -> 1, 1 -> 0}
                I can't think of a way to rename them.
                Probably requires some advanced proof to show they're not
                isomorphic.
                  
    "Foundations"
      What was the point of this chapter?
        iiuc
          Trying to explain where Jaynes got the main ideas used to guide
          deriving all the stuff.
      
      What are the most foundational ideas used for deriving Jaynes's prob
      theory?
        Probability theory as an extension of logic
        What's Probability theory as an extension of logic?
          iiuc
            The idea that prob theory should be able to assign probabilities
            to systems of logical statements and observations about those 
            systems.
            
            Why does Jaynes call it that?
              Because it extends the possible values of logic from false 
              and true, to a bunch of possibility values between.
          
            Why is that important?
              In classic prob theory, people start with random variables
              defined on sets of events.
              
              iiuc
                That makes it's awkward to assign probabilities to systems
                of logical statements, yet it is useful in practice.
        
        Pólya's and Cox's axioms/rules
        
        The finite sets policy
        What's the finite sets policy?
          The idea that infinite sets shouldn't have properties assigned
          to them, at least in probability theory.
          
          The way Jaynes proposes modeling things with infinite sets, is to
          always define a limiting process of finite systems, and see whether
          the properties of interest converge/work.
            What does limiting process mean?
              You just look at larger and larger systems constructed by some
              algorithm, and see whether a property of the system gets closer
              and closer to some value.
          
          Why?
            Jaynes says the classic approach of assigning properties to 
            infinite sets causes more paradoxes than always defining a
            limiting process.
      
      Some comparisons
        The Kolmogorov axioms of probabilitiy theory can be roughly derived 
        from Pólya's and Cox's work. At least where the axioms are applicable.
          
        deFinetti's prob theory differs in handling infinite sets, that causes 
        paradoxes.
    
    "Comparisons"
      What was the point of this chapter?
        iiuc
          Explain how prob theory ??philosophies compare when looking at the
          problems they can solve.
        
      @I might want to rewrite these notes once I understand better.
        
      What philosophies are discussed?
        Frequentism
        Bayesianism
        Probability as an extension of logic
        Maximum entropy
      
      By "better" I mean "more effectively" here.
      
      How do the philosophies overlap in the problems they can solve?
        iiuc
          Frequentism is the most narrow.
          
          Bayesianism solves many problems better than frequentism.
        
        I suspect that Bayesianism includes frequentism where it works,
        thought that wasn't clearly stated, so it might not be true.
        
        Prob theory as an extension of logic solves all problems
          frequentism and 
          Bayesianism can solve.
          It also solves some problems they can't.
        
        iiuc
          Maximum entropy can solve 
            all the problems Bayesianism can,
            and something extra.
          But Bayesianism solves them better than maximum entropy, if the 
          problem is "well developed".
        What does "well developed" mean?
          You have to know the problem's
            model,
            sample space,
            hypothesis space,
            prior probabilities.
            sampling distribution.
          @What does that even mean?
            tbh idk.
        
        tbh idu how 
          maximum entropy and 
          prob theory as an extension of logic 
        overlap.
      
      Limitations of philosophies?
        Frequentism requires that
        The problem is interpretable as a repeatable random experiment,
        which is rarely true.
        iiuc
          Other philosophies don't have this limitation.
          
        Frequentism works baddly, when
        Prior information is important.
          What does that mean?
            mbg
              Prior information means something like assumptions you know
              to be true about the problem.
        iiuc
          Other philosophies don't have this limitation as much.
          
        Frequentist statistics don't use all the data optimally if a good
        enough statistic doesn't exist.
        In Bayesianism, it's pretty straight forward how to use the data
        optimally.
        In practice Bayesianism can solve harder problems than frequentism.
        
        Some important tools used in "frequentist statistics" require
        extra assumptions that don't come from the axioms, and they break
        down at the extremes.
          e.g.
            Unbiased estimators
            Tail-area hypothesis tests
            Confidence intervals
        
        iiuc
          Other philosophies, but not frequentism, allow
          "elimination of nuisance parameters".
          What's that?
            iiuc
              Some ways of simplifying the calculations, if You're only
              interested in some information about the problem, but not all.
        
        To use Bayesianism a problem has to be "well developed".
        @Does frequentism require this?
          tbh idk.
        Maximum entropy doesn't require knowing the model and sampling
        distribution.
        
        @Limitations of prob theory as an extension of logic?
          I suspect similar to Bayesianism, though that wasn't clearly 
          stated.
        
        To use maximum entropy, You have to know a problem's
          sample space,
          hypothesis space,
          prior probabilities.
        Jaynes suspects there might exist principles which don't even require
        that.
        
      @Are there ways in which frequentism is better than Bayesianism?
        tbh idu.
      
      Jaynes proposes reading these books for examples of ??maximum entropy:
        Bayesian Spectrum Analysis and Parameter Estimation by Bretthorst,
        Maximum Entopy in action by Buck and Macaulay
        Data Analysis - A Bayesian Tutorial by Sivia
      
    "Mental Activity"
      In many cases prob theory can describe how people reason.
      In some cases the connection might seem surprising or even disturbing.
      Jaynes thinks it has potential for psychological or legal research.
        
    "What is 'safe'?"
      What was the point of this chapter?
        mbg
          Showing that incorrect prior information baked into models and 
          methods of analysis is important, and can lead to incorrect 
          results.
          
      If the prior information is wrong, it's possible that no 
      amount of data will bring correct results.
      Example?
        People assume a linear model between how much a substance is 
        eaten and how toxic it is.
        But that model is wrong, because many substances have thresholds
        up to which they're not dangerous at all.
        
        So testing extremely large doses can lead to overestimating 
        danger for normal doses,
        and testing extremely small doses can lead to underestimating
        the danger for normal doses.

      iiuc
        A more philosophy-level example of incorrect prior information:
          You can't derive Newton mechanics by modeling a coin with a
          stochastic model no matter how much data You get.
      
      Jaynes thinks these ideas are not taken seriously enough in medical
      research, and this is very dangerous.
      
    "Style of Presentation"
      @Note Draft
        What was the point of this chapter?
          Explaining various details relating to
            Structure of Jaynse's explanations,
            Jaynse's views relating to mathematical rigor and practicality,
            Roasting classic frequentist methods a little.
            
        Structure of the book?
          Part 1 contains:
            Explanations of the principles and axioms.
            Explanations for how to apply them.
            Explanations of where historically people have messed up.
          Part 2 contains:
            Just explanations of advanced applications.
          
          Jaynes will usually
            First explain the problem, and how people have failed in the past.
            Then work out a couple examples.
          
        Opinions on how to explain axioms and principles?
          Jaynes pays more attention to explaining the connection between the
          world and the principles and axioms, than he does to applications.
            Why?
              Because in his experience students have trouble understanding the
              connection to the real world, 
              but no trouble generalizing to new problems once they do.
          
          iiuc
            When explaining the principles and axioms
              Jaynes cares more about clarity and understandability than rigor.
            Because
              Rigor is useless if the the connection to the real world is not 
              well understood.
            
          Jaynes will be very strict with his derivations.
          Everything will be a result of the axioms and rules.
          And I suppose it will be clearly stated when extra assumptions or
          approximations will be done.
        
        Unorthodox opinions
          Jaynes thinks mathematics creates wrong results because of how it
          handles infinite sets, so rigor doesn't even necessarily lead to 
          correct results when working with infinite sets.
        
          Jaynes will use continuous approximations for finite systems.
          But he doesn't think "showing how to generate an uncoutable set as a
          limit of a finite one" is important.
            @What does that even mean?
              tbh idk.
        
        Jaynes will not do statistics as is done in frequentism,
        instead always using likelihood functions.
        Jaynses Justifications for doing this:
        iiuc
          the classic frequentist methods for working with statistics doesn't
          come as a result of the axioms, and thus causes "paradoxes".
          
          The likelihood function comes as a results of the axioms.
          
          As a result the likelihood funciton works much better:
            The likelihood function perfectly includes all the information in 
            the data, while usually classic statistics don't.
            
            It is always clear how to write down a likelihood function, but not
            always clear how to reduce the problem to a good statistic.
            This allows for much harder problems to get solved.
            
            likelihood function calculations often allow for simplifications to
            be done. (nuisance parameter elimination)
        
        Jaynes tries to explain everything as understandably as possible,
        that includes:
          Not using any linguistic tricks or hidden meanings.
          Explaning everything in plain English.
          iiuc
            Not explaining things that people understand intuitively 
            exceptionally well and can't really reduce much further, like what 
            does it mean for something to be true or exist.
        
  Part 1. Principles and elementary applications
    1. Plausible reasoning
      1.1 Deductive and plausible reasoning
        What are we currently trying to understand at all?
          What does common sense mean?
          And how to generalize it?
          
        What types of statements does math logic deal with?
          If 
            A => B and 
            A = True
          Then 
            B = True
        
        What types of statements do people usually use when thinking irl?
          If 
            A => B and 
            B = True
          Then 
            A becomes more probable
          
          Or
          If 
            A makes B more probable and
            B = True
          Then 
            A becomes more probable
            
            Example?
              You live in the same room as Your little brother.
              Occasionally "Your brother drinks juice in Your room".
              You come home, and see "some juice spilled on the floor".
              
              "Your brother drinks juice in Your room" makes 
              "some juice spilled on the floor" more likely so 
              it becomes more likely "Your brother drank juice in Your room".
              
              Even when in reality a burglar might have broken into the house,
              drank some juice, spilled some on the floor, and now is hiding in
              Your wardrobe, because You came home too early.
          
        Does prior information affect reasoning?
          Yes.
          
          Example?
            If every time You find "some juice is spilled on the floor", a 
            burglar turns out to have broken into Your house, You wouldn't be 
            so harsh on Your little brother.
            
        In practice even mathematicians always use the probabilistic style of 
        thinking when
          figuring out what conjectures/theorems are likely to be true and
          worth thinking about.
          
          How do we know that?
            Figuring out formal proofs is usually delayed to publication 
            writing time.
          
        Is mathematial implication causation?
          Nope.
          How come?
            It rains => It was cloudy 1 second ago
            but rain is not the cause of clouds
          
          Concrete connection between implication and causation?
            Causation implies implication, but
            implication doesn't imply causation.
      
      1.2 Analogies with physical theories
        How do physicists do things in their field?
          Approximate procedure
            Someone finds a pattern in some features of the world
            They write up an idealized model
            It seems to successully predict something about these features
            This is considered progress in the field
            
            Then use newfound knowledge to find larger patterns
        
          How does this relate to understanding common sense?
            Jaynes claims he'll take a similar approach.
            Finding small patterns in how probabilistic reasoning seems to 
            work.
            Then writing more general models.
          
          Why create a model at all?
            Creating and abstract model of probabilistic resoning would be 
            useful in cases 
              which are too complicated to be handled by human reasoning
              capabilities.
              
              Examples?
                Many propositions or variables.
                Situations where emotions come into play.
                Situations where precise estimates are necessary.
                etc.
      
      1.3 The thinking computer
        Psychologically it's better to think about
          how to build a thinking computer
          rather than how to model human common sense
        because
          it's hard to think about human common sense without becoming
          philosophycal and involved in debates.
            Because human common sense includes a lot of
            biases and inconsistencies
              Because the human mind hasn't evolved just for good reasoning.
      
      1.4 Introducing the robot
        Assume we have a robot who can do math logic.
      
      1.5 Boolean algebra
        What's AB mean in bool alg in this book?
          Both A and B are true
        
        What's A + B ...?
          A or B are true (not xor)
        
        What's A = B ...?
          A and B have the same truth value
        
        0th reasonable axiom of possible reasoning?
          If two statements are mathematically equivalent, their possibilities
          are equal.
        
        What's \overline{A} ...?
          Not A
        
        What's A => B ...?
          If A is true then B is also true
        
        How does possible reasoning improve on mathematical logic?
          In math if A => B, then we don't get any info about 
          B if we find out A is false, or 
          A if we find out B is true.
          
          However we know that irl there is some information there.
          That's what a possible reasoning model could help with.
        
        Note
          Why is "implication" a bad name for "=>"?
            "A implies B" is sometimes interpreted as "B can be deduced from A"
            In math "A implies B" just means "If A is true then B is true".
              e.g. 
                2 + 2 = 5 => I am the pope, is a math true statement. And
                2 + 2 = 5 => I am not the pope is a true statement.
                Because
                False => False is correct in math
                False => True is correct in math
      
      1.6 Adequate sets of operations
        Assume we know math logic.
      
      1.7 The basic desiderata
        What's desiderata?
          Useful quality/property that we would like to have.
          The author insists on using this word.
        
        What does Jaynes define A|B to mean?
          Possibility of A given that B is true.
          Corresponds to a real number.
          A|B has to be defined only if B is possible.
          
          Why exactly this definition?
            We usually we know something about A.
            We don't reason about abstract symbols we know nothing about.
            
            It's still possible to write down the possibility of an event given
            no observation by writing A|True.
            
          Does A|False exist?
            No. It's not defined.
          
          What if A or B is a number?
            A and B have to be bool statements.
              Clear enough that they work with bool logic.
            
            A or B can be statements like
              "x has a value a", which is usually meant by authors who write
            numbers instead of statements.
            
            Jaynes says he'll describe how not doing this, actually leads to
            paradoxes in practical situations in chap 15.
          
        Properties we'd like to have for possible reasoning?
          I Degrees of possibility are represented by real numbers.
            Why?
              Jaynes couldn't think of a system without this or a property
              equivalent in practice.
              There is also some theoretical requirement or something.
            
          The directions of updates correspond to common sense.
            A bigger possibility will correspond to a bigger number.
              Why?
                Convenience, but not necessary
            
            A small increase in possibility will cause only a small increase
            in the number. (i.e. continuity)
              Why?
                Convenience, but not necessary

            How does Jaynes understand the word "update"?
              Moment when we understand that the relevant information for calcing
              the possibility of statement A is C' instead of C.
                i.e. We calculate A|C' instead of A|C
              Usually probably because usually C' = C And "some new statement".
                
            If updating C makes B more possible
              i.e. (B|C') > (B|C),
            Then (Not B|C') < (B|C),
              i.e. The opposite of B becomes less possible.
              
              Why?
                Seems quite reasonable
              
            If 
              updating C makes B more possible,
                i.e. (B|C') > (B|C),
              and the update doesn't change the possibility of A given B,
                i.e. (A|BC') = (A|BC),
            then
              the information should increase the possibility of (A and B),
                i.e. (AB|C') >= (AB|C)
            
              Why?
                After thinking of a couple examples seems reasonable
            
            
          Consistency
            IIIa
            Any way of calculating the result leads to the same result.

            IIIb
            Prob theory doesn't guarantee working if not all relevant knowledge
            is put into condition.
            
            IIIc
            If in two problems the relations are the same except for
            labeling, any derived results should be the same.
          
          Why these properties?
            Author claims that
              these properties ~uniquely define a way to handle possibilities.
        
        Assume our robot follows these rules.
        
    2. The quantitative rules
      2.1 The product rule
        What variables go into calculating the possibility of AB|C?
          AB|C = F(B|C, A|BC) or F(A|C, B|AC)
          
          Why?
            It seems reasonable that the possibility of AB|C can be split
            into two parts:
              Calculating the possibility of 
                B given C 
              and then taking into account how possible it is we will see 
                A given B and C happened.
            
            Also there was an exhaustive proof by a guy named Cox that showed 
            other options would absurdly contradict intuition in everyday 
            situations.
        
        When an update C happens such that
          B|C' > B|C and
          A|BC' = A|BC and thus
          AB|C' >= AB|C
        When does equality happen?
          When A|BC' = A|BC = impossible.
          Why?
            Not explained really, but seems intuitive.
        
        When an update C -> C' happens such that
          B|C' = B|C and
          A|BC' > A|BC and thus
          AB|C' >= AB|C
        When does equality happen?
          When B|C' = B|C = impossible.
          Why?
            Not explained really, but seems intuitive.
        
        F is continuous and monotonically increasing on parameters.
          why?
            Didn't really understand, but seems reasonable.
        Except F(x, y) is impossible if x or y is impossible.
        
        F(x, F(y, z)) = F(F(x, y), z)
          Why?
            (ABC|D) = F(A|BCD, BC|D) = F(A|BCD, F(B|CD, C|D))
                    = F(AB|CD, C|D) = F(F(A|BCD, B|CD), C|D)
                    
          Why is this important?
            All possible solutions to this equation, given properties of F,
            are known.
        
          What are the solutions?
            F(B|C, A|BC) = inv(w)(w(B|C) * w(A|BC))
            w - positive & monotonic function
            
            Why?
              There was an advanced proof using differential equations and
              math anal.
              But I didn't relly understand it.
            
            Name?
              Product Rule
        
        What's NCC?
          For shortness I call that number corresponding to cerainty.
          e.g.
            (A or Not A|C),
            (True|C),
            etc.
          
        w(NCC) = 1
          why?
            Since equivalent logical statements have the same possibility
              (0) Take C => A, and thus A|C = NCC, then
                Because
                  (C => A) => A = True
                  True And B = B
                (1) AB|C = B|C and
                Because
                  B And C = True => (C = True) and
                  C => A
                (2) A|BC = NCC = A|C
            AB|C =
              = F(A|BC, AB|C) = (From 1 & 2)
              = F(B|C, A|C)
            (3) i.e. AB|C = F(B|C, A|C)
            From 3 and Product Rule.
              w(AB|C) = w(B|C)w(A|C) (Changing LHS from 1)
              w(B|C) = w(B|C)w(A|C)
              w(A|C) = 1 (Changing LHS from 0)
              w(NCC) = 1
            Q.E.D.
        
        What's NCI?
          -||- Number corresponding to impossibility.
          e.g.
            (False|C),
            (A And Not A|C),
            etc.
        
        w(NCI) = 0 or +inf
          why?
            Similar to proof for why w(NCC) = 1
        
          What happens if +inf?
            If one was to assume w(NCI) = +inf instead of w(NCI) = 0,
            that later turns out to work sort of equivalently.
        
            We'll assume w(NCI) = 0, as is traditional.
            
    2.2 The sum rule
      What's the algebraic relation between w(A|B) and w(Not A|B)?
        w(A|B)^m + w(Not A|B)^m = 1
        Which are all again sort of equivalent.
        
        But because the formulas look simpler, 
        and by tradition we'll take:
          w(A|B) = 1 - w(Not A|B)
        
        name?
          Negation rule.
        
        why?
          Algebraic, differential and math anal magic, didn't comprehend.
      
      What was all of this fucking around for?
        It was sort of proved that, 
          No matter what the mapping from possibilities to real numbers, and
          as long as it obeys the desiderata,
          there will exist a function w that maps from the possibility
          associated real numbers to [0, 1], 
          Such that the w's can be calculated regardless of the 
          possibility -> R association. (Which is ~explained later in notes)
          
          I'm really not sure if this doesn't cause inconsistencies, like
          mapping one w value to different possibilities, but not the first
          thing I don't understand.
          
          Also w has the properties:
            w(number corresponding to True = certain) = 1
            w(number oorresponding to False = impossible) = 0
            w(AB|C) = w(B|C) * w(A|BC)
            w(A|B) = 1 - w(Not A|B)
            If two statements w(A|C) and w(B|C) are symmetric, w(A|C) = w(B|C)
          
          Also calculations with this w(A|B) function in practice works
          analogously to how people traditionally do calculations with P(A|B).
            Except w(A|B) work with math statements and P(A|B) with sets.  
          Thus as is traditional we'll call the w function P(A|B).
        
      How to calculate P(A + B|C)?
        P(A + B|C) = P(A|C) + P(B|C) - P(AB|C)
        
        why?
          Algebraically derivable from negation and product rule.
      
      What's mutually exclusive mean?
        Two of the events cannot be true at the same time.
      
      Given all P(subset(Ai)|X), can You calculate any P(f(Ai)|f2(Ai)X)?
        Yes.
        How?
          Worst case:
          
          P(f(Ai)|f2(Ai)X) = (Product rule)
            = P(f(Ai)f2(Ai)|X) / P(f2(Ai)|X)
          
          Now the problem has been reduced to calculating two probabilities of
          form P(f(Ai)|X).
          
          Next calculate probabilities of all possible combinations of Ais,
            e.g. P(A1 And Not A2 And A3 And A4 And Not A5...|X),
          (Which is possible using the product and negation rules).
          
          Next write f(Ai) as a sum of all the relevant combinations of Ai.
          
        In practice, usually there are other ways of getting the result faster
        and without knowing all the P(subset(Ai)|X) values.
      
    2.3 Qualitatitive properties
      Does implication make sense using Ps?
        Yes?
        
        Why?
          Does knowing A in implication still imply B?
            C = (A => B)
            P(B|AC) = P(AB|C) / P(A|C)
            C => AB = A
            P(B|AC) = P(A|C) / P(A|C) = 1
        
          Does knowing B is False imply A is False?
            C = (A => B)
            P(A|Not B and C) = P(A And Not B|C) / P(Not B|C)
            C => A And Not B = False, i.e. is impossible assuming A => B
            P(A|Not B and C) = P(False|C) / P(Not B|C) = 0
      
      Does the statement 
        "If A => B, then (if B = True then A becomes more possible)"
      work?
        Yes.
        
        Why?
          C = (A => B)
          P(A|BC) = P(AB|C) / P(B|C)
          C => AB = A thus
          P(A|BC) = P(A|C) / P(B|C)
          1 <= 1 / P(B|C)
          So P(A|BC) >= P(A|C)
            i.e. Knowing that B is true, makes A|BC more likely.
      
      Does the statement A => B, thus if A = False then B is less possible 
      work?
        Yes.
        
        Why?
          (1) C = (A => B)
          P(B|Not A and C) = P(Not A and B|C) / P(Not A|C) (Prod. rule)
          (2) P(B|Not A and C) = P(B|C) * P(Not A|BC) / P(Not A|C)
          Since ((A => B) => P(A|BC) >= P(A|C))
            P(A|BC) >= P(A|C) => (Negation rule)
          1 - P(Not A|BC) >= 1 - P(Not A|C) =>
          P(Not A|BC) <= P(Not A|C) =>
          P(Not A|BC) / P(Not A|C) <= 1 => (From dividing 2)
          P(B|Not A And C) <= P(B|C)
          Q.E.D.
      
      Does the statement 
      (Seeing A makes B more possible, thus seeing B makes A more possible)
        i.e. 
        (0) P(B|AC) > P(B|C) => 
        P(A|BC) > P(A|C)
      work?
        Yes.
        
        Why?
          From prod. rule:
          P(A|BC) = P(AB|C) / P(B|C) => (Prod rule)
          (1) P(A|BC) = P(A|C) * P(B|AC) / P(B|C)
          From given:
          P(B|AC) > P(B|C) =>
          (2) P(B|AC) / P(B|C) > 1
          From 1 and 2
          P(A|BC) > P(A|C)
          Q.E.D.
        
        Interesting memes from looking at formula 1.
          If seeing A makes B only slightly more likely, then seeing B makes
          A only slightly more likely.
            i.e. (3) P(B|AC) = P(B|C) + eps1 => P(A|BC) = P(A|C) + eps2
            
            Why?
              Formula 1:
              P(A|BC) = P(A|C) * P(B|AC) / P(B|C) => (From 3)
              P(A|BC) = P(A|C) * (P(B|C) + eps1) / P(B|C) =>
              P(A|BC) = P(A|C) * (1 + eps2) =>
              P(A|BC) = P(A|C) + P(A|C) * eps2 =>
              P(A|BC) = P(A|C) + eps3
              Q.E.D.
            
            e.g?
              Eating at McDonalds makes it slightly more likely a person will 
              become overweight.
              So knowing someone's overweight makes it slightly more likely
              they eat at McDonalds, but not by a lot.
              
              If a person is gay, it makes them slightly more likely to talk
              about gay shit.
              So knowing someone talks about gay shit, makes them more likely 
              to be gay, but not by a lot, because the possibility of someone
              talking about gay shit is high anyway.
          
          What's another interesting meme is that the preivous effect always
          makes A more likely, never less likely, if it was possible at all.
          
          For A to increase a lot when B is observed, it is necessary but not
          sufficient for P(B|C) to be small
            i.e. P(B|C) ~= 1 => P(A|BC) ~= P(A|C)
            Why?
              If P(B|C) ~= 1 =>
              P(B|AC) / P(B|C) ~= P(B|AC)
              P(B|AC) > P(B|C) =>
              P(B|AC) ~= 1 =>
              P(B|AC) / P(B|C) ~= P(B|AC) ~= 1 => (From 1)
              P(A|BC) ~= P(A|C)
          
            e.g.?
              A person who's religious would definitely cellebrate christmas,
              Yet nearly everyone cellebrates christmas, so You get nearly
              no information on whether they're religious.
              i.e.
                Assume
                  A = Religious
                  B = Cellebrates X-mas = X-mas
                  
                  P(Religious|C) ~= 0.1
                    i.e. Default chance of being religious
                  P(X-mas|Religious and C) ~= 1.0
                    i.e. Chance of cellebrating x-mas given religious
                  P(X-mas|C) ~= 0.99
                    i.e. Default chance of cellebrating x-mas
                Then
                  P(Religious|X-mas and C) = P(Religious|C) *
                    P(X-mas|Religious and C) / P(X-mas|C) =>
                  
                  P(Religious|X-mas and C) = 0.1 * 1.0 / 0.99 ~=
                    = 0.1 * 1.01 ~= 0.101
          
              Seeing that someone breaks a window is very rare, so it gives
              You a lot of information on whether the person should be
              arrested.
                i.e.
                  Assume
                    A = Vandalized someone's property = Vandal
                    B = Broke a window = Broke
                    
                    P(Vandal|C) ~= 0.0001
                      i.e. Chance of being a Vandal.
                    P(Broke|Vandal and C) ~= 0.05
                      i.e. Chance of breaking a window given a vandal.
                    P(Broke|C) ~= 0.00001
                      i.e. Chance that someone would break a window.
                  Then
                    P(Vandal|Broke and C) = P(Vandal|C) *
                      P(Broke|Vandal and C) / P(Broke|C)
                      
                    P(Vandal|Broke and C) = 0.0001 * 0.05 / 0.00001
                    = 0.0001 * 5000 = 0.5
                  
                  The rarity of the event made the possibility increase 5000
                  times.
                  I made these numbers up to illustrate the idea.
              
          Many more memes like this that make intuitive sense can be derived
          from the formula.
          
    2.4 Numerical values
      P of Or of mutually exclusive events?
        P(A1 + A2...|C) = Sum(P(Ai|C))
        
        Why?
          repeated application of sum rule.
      
      The problem of intuition in math?
        In some new fields of math, at first intuition might be a lot easier to
        use that the logical tools of the field.
        So people get used to relying on intution for everything in the field.
        
        Because people don't usually know when intution breaks down, they apply
        intution where it doesn't work, which is a problem.
      
        This sometimes leads to math people arguing over their intuitions 
        instead of the math.
      
      Jaynes says: "Please use only logic in the next couple points not 
      intution, thx."
      
      What's the principle of indifference?
        Imagine evaluating probabilities P(A1|B), P(A2|B)...
        If B doesn't say anything about Ai as compared to Aj,
          i.e. We could relabel Ai <- Aj; Aj <- Ai, and get the same problem.
        Then from desideratum IIIc
          The possibilities of Ai and Aj should be equal.
        
        If A1...An have same info, then
          P(Ai) = 1/n for all i.
        
        Applications?
          If You don't have any information about one option as compared to
          another, assuming correct axioms/desiderata, it's math necessary to
          assume equal possibilities for both.
      
          Say You're in a foreign city, and taking a walk with friends.
          You're not sure which path would be more interesting, the one on the
          right or one on the left.
          From symmetry, both have the exact same possibility of being the most
          interesting, given Your information.
          
          A die roll has a 1/6 chance of landing on any face, if it's a regular
          die, that's not rigged.
      
      What does Jaynes call a probability? 
        P(A|B)
        But not A|B
        
      Conjunction?
        Analogue of and gate
      
      Disjunction
        Analogue of or gate
      
      Disjunction inequality for two statements?
        P(A+B|C) >= P(A|C) and
        P(A+B|C) >= P(B|C)
        And if P(A|C) + P(B|C)
        P(A+B|C) <= P(A|C) + P(B|C)
        P(A+B|C) <= 1
        Why?
          Algebramagic. From previous properties.
          
      Conjunction inequalities for two statements?
        P(AB|C) <= P(A|C) and
        P(AB|C) <= P(B|C)
        And if P(A|C) + P(B|C) > 1:
        P(AB|C) >= P(A|C) + P(B|C) - 1
        P(AB|C) >= 0
        Why?
          Algebramagic. From previous properties.
      
    2.5 Notation and finite-sets policy
      What does it mean when an author writes P(q|B), where q is a number?
        Usually it means P(A|B), where A = the value of some var x is q.
        Usually this is done to try to make a function that returns 
        probabilities.
        
        Jaynes say that in using the notation P(q|B) can lead to "paradoxes",
        which will be described in chap 15.
        
        How does Jaynes denote probabilities that depend on values?
          e.g.
            f(q|p) = P(Q|P),
              where Q = var x has val q, and 
            P = var y has val p
      
      What's the finite-sets policy?
        @I'm not completely sure
        I should update this after reading chap. 15
        
        When You want to work with 
          an infinite amount of anything,
          an infinitely small amount of anything,
          continuous things,
        first create a process of building larger and larger finite systems,
        and see whether the property of interest converges.
        This is called the limiting process.
      
        If it doesn't converge, don't use it.
      
        Don't try to work with the limit results without thinking of the 
        limiting process.

        When can the limiting process be ignored?
          If there's some standard situation where it's been proven the limit
          doesn't have to be verified.
            @Need more e.g.
            e.g.?
              Addition of convergent series stays convergent.
            anti e.g.?
              Addition of series that are not known to converge.
        
        Why use it?
          In practice the finite sets policy doesn't make errors,
          while working with inifinities makes a lot of "paradoxes".
          
          This is because
            Not clearly defining the limit process can lead to intuitively 
            using defferent limiting processes.
            This can lead to different results, and seeming paradoxes.
          
          e.g.?
            Integrating notation doesn't contain descriptions of the limiting
            processes, which I know in my experiencem leads to a ton of 
            confusion and apparent inconsistencies.
            
            People arguing about whether 1 - 1 + 1 - 1... = 0, 0.5, 1 or non
            existant.
      
    2.6 Comments
      How does Jaynes use the words Subjective and Objective in this book?
        Subjective - Something that represents partial knowledge about a thing.
          i.e. Any probability not 0 or 1.
        Objective - Something independant of personality.
          i.e. something that correctly uses the axioms/desiderata.
          Why like this?
            Objectivity is associated with science.
            If we had this definition it would be easier to cooperate.
        
        So You can be both at the same time.
      
      What does it mean for a system of axioms to be inconsistent?
        Statements A and Not A can both be derived.
      
      What does Godels theorem approx say?
        (1) A set of axioms cannot proove itself consistent, 
          because if it were inconsistent You could derive any statement 
          including "This set is consistent"
        So You cannot be certain whether "This set is consistent" is
        drived because it's an inconsitency or because it actually is 
        consistent.
        
        An algorithm that can be prooven
          would find an inconsistency in finite time, 
            if one existed, 
        does not exist.
        Because otherwise it could proove its own consistency, which is against
        (1).
        
        Jaynes says that since, to calculate P(A|E1E2E3...) E1E2... must be
        possible, i.e. consistent, if prob theory is consistent.
        So it can proove consistency of statements E1...En, assuming prob 
        theory correct.
        Which doesn't break Godels theorem because it doesn't proove prob
        theory is consistent.
        
        But I really really didn't understand.
      
      Problems with using Venn diagrams?
        Defining probability theory on statements and beliefs is more general
        than "defining" using Venn diagrams.
        When drawing a Venn diagram one intuitively assumes:
          Probabilities are defined over sets.
          Connection between area an probability.
          A point on the diagram corresponds to a possible elementray event.
          Maybe some probability associated with an elementary event.
        This does makes sense when working with statements about sets or areas.
        But can be ineffective and confusing otherwise.
        
        e.g. 
          Assume You have a complex system of interconnected logical knowledge
          propositions. Where only some states of knowledge seem likely or
          worth considering.
          
          Then,
            A multidimensional Venn diagram would be hard to imagine.
            Venn diagrams ~sort of force You to consider all combinatinons of
            propositions, not just the likely or possible ones.
            It would be quite hard to add more propositions/knowledge to the
            graph.
          
          Meanwhile just interpreting stuff as equations Always works.
          i.e. It's kind of more general than just thinking about sets.
    
    Appendix A
    
      How does Jaynes's system compare to KSP?
        KSP?
          Kolmogorov System of Probability
        
        Not completely sure, but Jaynes says something among the lines of:
          KSP is defined around Or gates and sets, which makes it harder to 
          work with states of knowledge, because
            A state of knowledge is usually defined as a set of facts about
            prepositions that are all true through And gates not Or gates.
            Also adding a new piece of knowledge is harder to intuitively
            interpret in terms of sets.
          
          The Jaynes system can just put any knowledge C in P(A|C), be it
          defined using Or gates or And gates.
          
          The Jaynes system is slightly more complicated to use when working 
          with infinite sets.
        
      How does Jaynes's system compare to dFSP?
        What's dFSP?
          de Finetti's System of Probability.
          
        Not completely sure, but, i think, Jaynes approximately says:
          The desiderata of consistency IIIc, is more general than de Finetti'a
          axiom of "not losing any game".
           
          dFSP like KSP assumes probability to be defined on sets.
            So inherits a lot of KSPs problems
          dFSP works with infinite sets.
            It causes a buch of infinite set paradoxes, that are patched using
            weird math like zeros of different size.
        
          In dFSP it's harder to deduce prior probabilities than in Jaynes's
          system.
          Instead dFSP assumes more that prior probabilities are known.
      
      Why did we assume real numbers for possibilities?
        Really not sure, but Jaynes says something like:
          If possibilities weren't assigned real numbers,
            For practical purposes we'd need to assume possibilities are
            comparable and transitive.
            
            Which at the limit of knowing all comparisons for an arbitrarily
            large set of propositions is equivalent to being assigned
            real numbers.
          
          But it might bring practically useful results to study what happens
          when the comparisons are sparse, to model how humans think.
    
    Chapter 3 Elementary Sampling theory
      Jaynes says pretty much all of statistics can be derived from the
      principle of indifference.
      That's what the next 7 chapters are about.
      
      Why does it not make sense to think of probability as a property of a
      system?
        If You look at the same system, depending on Your knowledge, Your
        probabilities will be very different:
          e.g.?
            C = 
            Imagine an urn with N white balls, and 10 - N black balls.
            And a perfect robotic arm pulling them out from an urn.
            Where N is chosen uniformly from 0 to 10.
            
            What interests You is
            W = The first ball pulled out being White.
            
            Scenario 1:
              You don't know anything appart from C.
              Given Your knowledge P(W|C) = 0.5
            
            Scenario 2:
              D = You also know that N was chosen to be 8.
              Given Your knowledge P(W|CD) = 0.8
            
            Scenario 3:
              E = You know N is 8.
                You also looked into the urn, and know how the robot arm 
                will move.
                So You're certain the first ball will be white.
              Given Your knowledge P(W|CE) = 1.0
            
            The probability of the exact same physical system differs only
            based on Your knowledge, so it doesn't make sense to interpret
            probability as a physical property, instead of a state of
            knowledge.
          Not thinking this way makes it very awkward analyzing these
          situations.
      
      Assume we have an urn with w white and b black balls. How to calculate
      the probability of pulling out some permutation of n <= w + b balls?
        Assuming 
          pw = white balls in permutation
          pb = black balls in permutation
        
        Possibility of all possible permutations for same pw and pb is equal.
          (This is not an assertion, this can be proven)
        P(...) = w! b! (w + b - pw - pb)! / ((w + b)! (w - pw)! (b - pb)!)
       
        why?
          Can be proven using
            Product rule,
            Algebraic magic,
            induction,
            principle of indifference
      
      Assume an urn with w white, and b black balls. How to calculate the
      probability of pulling out exactly pw white balls when drawing 
      n balls?
        pb = black balls in draw
        
        There are 
          choose pw of n = n! / (pw! pb!) 
        possible options, of how this might happen.
        
        From previous result and the fact that pw and pb is the same for all
        of these options, the probability is
          P(...) = w! b! (w + b - pw - pb)! / ((w + b)! (w - pw)! (b - pb)!) *
            * n! /(pw! (n - pw)!) =
            = Choose pw of w * Choose pb of b / Choose (pw + pb) of w + b
        
        Name of this function?
          hypergeometric distribution
        
        Traditional notation?
          h(pw|N, w, n) = choose pw of w * Choose (n - pw) of (N - w) /
            / Choose n of N 
          Where
            N = w + b
          
          Interpretation?
            "Probability of getting exactly pw positive samples in n draws, if 
            the population contains N samples, w of which are positive."
      
        What's the most probable value of pw for h?
          pw_max = floor((pw + pb + 1) * (w + 1) / (w + b + 2))
          i.e. the one closest to n * (w / N)
          
          Why?
            Algebramagic.
        
        Symmetries?
          If pw is swapped with pb, and w is swapped with b, nothing changes.
          
          Non obvious fact:
            Swapping the number of draws with the number of white balls, but
            recaluclating everything so that N the same, gives
            equal probabilities for different pw.
            
            From symmetry same goes for black balls.
      
      What's probability distribution function?
        function that returns some probability from a system of statements.
        Where the system depends on the numbers passed to the function.
        
        I'm not sure, but Jaynes calls both probability distribution funcs
        and probability density funcs probability distribution funs.
      
      What's a cumulative probability distribution function?
        Probability that the value of a variable is less than x.
          In the discrete case:
            a function H(x) = sum h(y), where y <= x
        
      What's a median?
        The point where H(x) = 0.5
        
        in practice?
          point where H(x) is closest to 0.5
      
      A useful method that often works, when You need to find the probability
      of a statement You don't know:
        Try to split a statement into mutually exclusive sub statements.
        Calculate their probabilities according to sum and product rules.
      
      3.2. Logic vs. propensity
        Can knowing things that will happen in the future affect the
        probability of what will happen now?
          Yes.
          Why does Jaynes think so?
            Possible reasoning, is only about logical connections.
            The underlieing causal structure of the universe doesn't matter.
          
          e.g.
            C = A robot arm is pulling out balls from an urn with 1 white 
            and 1 black balls.
            
            You're interested in 
              A = The probability of the first ball being white.
            
            Scenario 1:
              You know only C.
              P(A|C) = 0.5
            
            Scenario 2:
              D = The designer of the system says, he ran an experiment,
              and he knows the last ball the robot will take out will be white.
              P(A|CD) = 1.0
          
        Bernulli urn time symmetry?
          Knowing the outcome of a ball in the future has the same impact as
          knowing the outcome of a ball from the past.
        
        Is it worth interpreting P(A|B) as B causing A?
          No.
            e.g.
              A = Spilling tea on the ground
              B = There being tea on the ground
              A causes B
              But knowing B makes A much more likely
                i.e. P(A|C) < P(A|BC)
              
              And it's a perfectly fine way of reasoning, even if B didn't
              cause A.
            
            e.g. previous bernulli urn example.
        
        What's an exchangable distribution?
          A probability distribution function that has multiple parameters, 
          and the result of the function doesn't depend on the order of those 
          parameters.
          
          E.g.
            The probability function You get when You have an urn with balls,
            and You look at the probability a permutation of drawn balls
            comming out as x1, x2, x3...
              i.e. f(x1, x2, x3...|C)
          
      3.3. Reasoning from less precise information
        Mindfuck.
        
        If You have an urn with b and w balls.
        And know that the nth draw is white,
          that decreases the probability of 1st being white.
        But knowing one of multiple future draws is white,
          decreases the probability of 1st being white by a lot less.
          
          e.g.
            w = 2, b = 2,
            Wx - statment that x-th draw is white.
            
            P(W1|W4 C) = 1 / 3 < P(W1|(W3 Or W4)C) = 2 / 5
        
        
      3.4. Expectations
        Expectation definition?
          Notation?
            E(X)
              or
            <X>
          
          Prerequisites of X
            X is some math object dependent on the truth/false values of a 
            system of statements, that are probabilistic.
            X can take on some set of values {xi}.
          
          Value?
            E(X):= <X>:= sum(xi * P(xi|C))
          
          Why is expected value possibly a bad name?
            In human language "Expected value" could be interpreted like
            "Value that is very likely/most likely to be seen".
            
            But by the math definition, in pretty much all cases that's not 
            true.
        
          Why is expectation important at all?
            Well Jaynes says:            
            
            In many problems a probability of interest equals an expectation 
            of some probability distribution, or fraction of probability 
            distributions.
            
            e.g.
              C = Bernulli urn example. w, N.
              
              Say we want to know whether 
                A = 1st ball will be white
              And we know
                B = something we know about probabilities of other balls,
                  or gate of exclusive statements Bi = "later balls xi, xj...
                  will be respectively w/b, w/b...."
              
              Then
                P(A|BC) = Sum P(A|BiC)P(Bi|C) = 
                = Sum ((w - count w in xi...) / (N - count xi...)) * P(Bi|C)
                = E((w - count w in xi...) / (N - count xi...))
                Which is an expectatoin of a fraction
                
                p.s. The "count w..." is a prob distribution function, and
                should be written something like f(w|Bi) or something like
                that.
            
      3.5 Other forms and extensions
        What's generalized hypergeometric distribution?
          Sounds like something from sci-fi
          
          probability distribution function
            f(w1, w2, w3...|W1, W2, W3...)
            Assuming a bernulli urn with Wi balls of color i, probability
            of pulling out exactly wi balls of color i.
          
          Calculate how?
            Product(Choose (wi of Wi)) / Choose(sum(wi) of sum(Wi))
            
            Proof?
              Induction 
                look at what happens if You split Wn into two new colors 
              and product rule.
      
      3.6 Probability as a mathematical tool
        Probability theory can be useful for getting identities in
        combinatorics.
      
      3.7 The binomial distribution
        What happens to the hypergeometric distribution, as
          pw, pb - stay the same,
          w, b -> inf, and
          w/(w + b) -> c
        ?
          c^pw * (1 - c)^pb * choose pw of pw + pb
          
          name?
            Binomial distribution
            
            official notation?
              b(pw|n, f)
          
          why?
            Just caluclate the limit, unpacking some of the factorials.
        
          What hapens to the generalized hypergeometric distribution?
            Choose w1, w2,... of Sum(wi) * prod(fi^wi)
              Where fi - proportion of wi colored balls.
        
      3.8 Sampling with replacement
        3.8.1 Digression a sermon on reality vs. models
          How do some prob theory writers usually get rid of knowledge that 
          makes calculations too complicated?
            Add a lot more knowledge, which makes the calculation much more 
            complicated.
              e.g. 
                When drawing balls from an urn and placing them back in, 
                shake the urn between draws.
            Claim that the problem is now equivalent to a simpler problem.
              e.g.
                Say we have uniform probability over all balls in the urn.
            Call the process randomization.
            
            Why is this not correct?
              Information from a previous draw might still be useful, to
              a computer that knew precise shaking movements, and
              understands quantum effects, 
                i.e. With some enormous amount of compute, it could have
                better knowledge than uniform.
            
            Why does this sort of work?
              In practice at least no human is capable of taking into account 
              the knowledge of a lot of shaking in a way that causes a huge 
              deviation from uniform probability.
            
            Why is this important at all?
              It's probably not really important. 
              This is mostly to rek frequentists,
              with probably no practical applications.
              
              The fact that randomization works/exists,
              doesn't come from the axioms of math, it's an approximation,
              so there might be extreme situations where it breaks down.
              Like when taking limits.
              
              Which means You can't prove frequentism's most fundamental
              theorem, that probability is the frequency in the limit of
              repeated independent trials.
              
              And because randomization is just a mental approximation not 
              something that exists in the world, it's wrong to say that one's
              measuring the probability as a property of the system.
              Which is what frequentists say all the time. Which is the mind
              projection fallacy.
          
        3.9 Correction for correlations
          A more appropriate name for this chapter would be:
            Brutal rekt of frequentism.
            Just hands down.
          
          Practical proof randomization doesn't work in the limit?
            Imagine a bernulli urn, with repeated samples with replacement and
            randomization.
            
            Because randomization doesn't exist. In practice usually, there
            will be some correlation between color of ball currently drawn,
            and color of ball drawn and the color of the previous ball.
            Assume the correlation with balls before the last ball are
            negligible.
              i.e.
                Xi - event that Xi's draw is of color X.
                P(Xk+1|Xk C) = P(X1|C) + eps
            Usually the correlation is small 
              i.e. eps <= 0.01
            Let's say
              P(Wk+1|Wk C) = p + eps1
              P(Bk+1|Bk C) = 1 - p + eps2
            Then
              P(Wk+1|Bk C) = p - eps2
              P(Bk+1|Wk C) = 1 - p - eps1
            
            For i > 1
            The probabilities P(Wi|C) and P(Bi|C) will be a linear combination
            of P(Wi-1|C) and P(Bi-1|C).
            So we got the matrix equation
            [P(Wi|C), P(Bi|C)] = 
              next_step ^ (i - 1) * 
              [P(W1|C), P(B1|C)]
              where next_step is the matrix
                [[p + eps1, p - eps2], [1 - p + eps2, 1 - p - eps1]]
            
            The matrix next_step ^ (i - 1) can be expressed as
              S ^ -1 * Eig ^ (i - 1) * S,
              where S is the matrix that transforms a vector to eigenvector
              coordinates.
            
            The matrixes can then be opened to get a neat formula with
            exponentials. I'm too lazy to calculate it tho.
            
            But the thing is, that as i goes to infinity
            P(Wi|C) doesn't converge to P(w1|C) = p, but to some weird
            fraction (p - eps1) / (1 - eps1 - eps2).
            
            Which literally means the frequentist interpretation that 
              "the probability tends to the fraction of observations as 
              observations tend to infinity."
            is literally always wrong, because randomization doesn't exist.
            
            This can cause legitimate problems in experiments if n * eps ~> 1.
            
        3.10 Simplification
          What's inference time symmetry?
            P(Aj|Ai C) = P(Ai|Aj C).
          
            What does P(Ai|C) = p tell about inference symmetry?
              It's true.
              
              Why?
                Product rule.
          
        3.11 Comments
          The urn derivations can be by a small stretch generalized to a lot of
          experiments, than can have similar correlation effects.
          
          Where the urn analogy might not cut it?
            In experiments where 
              the data doesn't resemble anything like a population, or
                e.g.
                  Flipping a coin,
                  Measurements of physical systems,
                  Anything related to timeseries
              if the data might change over time.
                e.g.
                  Temperature or wind measurements,
                  Anything related to timeseries
          
            Interpreting the expriments as equivalent to urns can put a mental
            block on looking at information, which might be useful to us, not 
            describalbe by an urn analogy, like changes over time.
          
          3.11.1 A look ahead
            Sampling theory is mostly concerned with the questions, given a
            hypothesis for the distribution, what is the probability of seing
            some data?
            
            However in real life prob theory problems are usually in the form: 
            given some data, what is the probability of a hypothesis being
            correct?
            
            Jaynes says he'll nearly exclusively focus on the second problem.
            
      4.Elementary hypothesis testing
        4.1 Prior probabilities
          What's prior information?
            In Jaynesean prob theory:
              All information an agent has other than the information stated 
              in the problem(D) it's solving.
            
            What's the X in Jaynesean prob theory?
              Jaynes denotes prior info with X.
            
            Are D and X fundamental concepts?
              It's more just useful organization in some problems.
              How X and D are divided should theoretically chance the result.
          
          What's a prior probability?
            Probability of something before taking into account any information
            from a problem.
              i.e. P(H|X)
          
          What's a posterior probability?
            Probability calculated after taking data into account.
              i.e. P(H|DX) as compared to P(H|X)
          
          Why are prior and posterior probs useful?
            They're just nice names that help talk about bayeses formula.
          
          Why do frequentist stats struggle with getting H from D?
            Since we know
              P(H|DX) = P(H|X) P(HD|X) / P(D|X)
            To get P(H|X) and P(D|X), some prior probabilities P(H|X) must be
            figured out.
            Setting them is hard to justify in in frequentism.
            
            How can this be solved?
              In Jaynesean statistics, simply calculate P(H|X) from the prior
              info X, and calculate P(H|DX) as usual.
          
          What's a likelihood?
            Any function of the form
              L(H) = P(D|HX) * f(D)
            
            Why is this important?
              It's used in the next chapter I think
          
        4.2 Testing binary hypotheses with binary data
          What's a binary hypothesis?
            You got two options either H or not H.
          
          What are odds?
            For a binary hypothesis H.
            O(H|DX) = P(H|DX) / P(not H|DX)
          
          Bayes formula for odds?
            O(H|DX) = O(H|X) * P(D|HX) / P(D|(not H)X)
            
            Why?
              P(H|DX) = P(H|X) P(HD|X) / P(D|X)
              P(not H|DX) = P(not H|X) P((not H)D|X) / P(D|X)
              Dividing them You get what You want.
            
            Why is this important?
              O(H|DX) or P(H|DX) can be calculated without knowing P(D|X).
              Because P(H|DX) = O(H|DX) / (1 + O(H|DX))
                (Can be prooven algebraically)
              So You only need P(H|X), P(not H|X), P(D|HX) and P(D|(not H)X)
                or O(H|X) and P(D|HX), P(D|(not H)X)
            
          What's evidence?
            Logarithm of odds.
            e(H|DX) = log2(O(X|DX))
            
            Bayes formula for evidence?
              e(H|DX) = e(H|X) + log2(P(D|HX) / P(D|(not H)X))
              
              why is this important?
                Evidence caluclations can be quite easily done in Your head.
                Because addition is easier than multiplication.
                
                E.g?
                  What's the probability of electi
          
          What's logical independence under H?
            P(Di|DjHX) = P(Di|HX)
          
          How to do evidence calc multiple tests?
            If the data D = D1D2D3..., then
              e(H|DX) = e(H|X) + log2(P(D1D2D3...|HX) / P(D1D2D3...|(not H)X))
              Using product rule and logarithm properties, this can be reduced to
              e(H|DX) = e(H|X) + log2(P(D1|D2D3...HX) / P(D1|D2D3...(not H)X)) +
                log2(P(D2|D3D4...HX) / P(D2|D3D4...(not H)X)) + ...
              For many experiments knowing Di doesn't affect Dj, if we already
              also know H or (not H). So the formula can be simply rewritten as:
              e(H|DX) = e(H|X) + log2(P(D1|HX) / P(D1|(not H)X))
                + log2(P(D2|HX) / P(D2|(not H)X)) + ...
            
            Why is this important?
              Assuming logical independence under H. You can intuitively quite
              easily just add evidence bits to calculate real probabilities.
            
            Example of calculation?
              W - Person is a weeb
                e(W|X) = ~3%(based on google search) - 1:32 - -5 bits
              nW - Person is not a weeb
              
              J - Person goes to Japan
                I'd say a Person, who weeb is probably 10 times more likely to
                go to Japan than a regular human.
                log2(P(J|WX) / P(J|nWX)) = log2(10:1) ~= 3 bits
              
              L - Person studies Japanese
                I'd say a Person, who weeb at least 100 times more likely to
                study Japanese than a regular human.
                log2(P(L|WX) / P(L|nWX)) = log2(100:1) ~= 6.5 bits
              
              D - Person has a daki
                I'd say a Person, who weeb, at least 1000 time more likely to
                have a dakimakura for Whatever Reason. Otherwise they'd throw 
                it out ffs.
                log2(P(D|WX) / P(D|nWX)) = log2(1000:1) ~= 10 bits
              
              A - Person watches asian porn
                I'd say a Person, who weeb, prolly like twice as likely to watch
                asian porn on the regular.
                log2(P(A|WX) / P(A|nWX)) = log2(2:1) ~= 1 bits
              
              P - Person is a programmer
                I'd say a Person, who weeb, like 10 times as likely to be
                antisocial and a programmer, than a regular person.
                log2(P(P|WX) / P(P|nWX)) = log2(10:1) ~= 3 bits
              
              not N - Person watches anime
                I'd say a Person, who weeb, is like 1000 times less likely to
                not watch anime as a regular person.
                ~= 10 bits
              
              Raity - event where person has the setup - JLDAP(not N)
              Now, if we add up all the evidence
                e(W|RarityX) = -5 + 3 + 6.5 + 10 + 1 + 3 - 10 = 8.5 bits
                ergo
                P(W|RarityX) ~= O^-1(350:1) ~= 99.7 %
                
                So Raity You're fooling no one.
          
        4.3 Nonextensibility beyond the binary case
          How to generalize the binary hypothesis testing case?
            The case when choosing between one of n > 2 hypotheses in general
            doesn't have an as easy to calculate solution as the binary case.
              Problems with n > 2 exclusive hypothesis testing?
                Logical independence
                  i.e. P(Di|Dj(not H)X) = P(Di|(not H)X)
                cannot in general be assumed.
                  Why?
                    doing a simple example shows it just doesn't work.
              
                Lack of logical independece leads to evidence not being easy to
                simply add like in the binary case.
            
            The probabilities can still be calculated, it's just harder.
                
          What's the connection between causal link and logical independence?
            There isn't one.
            logical independence doesn't imply no-causality
              e.g. 
                in general P(Di|Dj (not H)) != P(Di|(not H)) for multiple
                hypotheses, even if there is no causal link.
            and a causal link doesn't imply logical dependence
              e.g. 
                Agen't doesn't have a way of modeling causality in H or X.
          
        4.4 Multiple hypothesis testing
          Intuitive example of why evidence should work differently in
          multiple hypothesis testing?
            X - There's a machine that spits out widgets.
            Imagine there are two hypotheses.
              H1 - This is the Mk1 model that produces 1/3 broken widgets.
              H2 - This is the Mk2 model that produces 1/6 broken widgets.
              P(H1) = 1/2
              P(H2) = 1/2
              
            From binary hypothesis testing seeing a broken widget increases
            the evidence of H1 by 1 bit.
            
            Now say You've seen 100 broken widgets in a row.
            That would make the O(H1) ~= 10^30:1.
            
            Any reasonable human would think there's likely a third hypothesis
              H3 - The machine is broken and produces only broken widgets.
            which has a low probability
              P(H3) = 10^-5
            
            In the beginning it makes sense that P(H1) increases as we observe 
            more broken widgets.
            But at some point the probability that the machine is just broken
            should become much more likely than H1.
            
            So intuitively in general
              the gain/loss of H1 evidence should depend on other evidence
              evidence.
              
            So that's why the simple binary solution doesn't generalize.
            
            It is still possible to calculate P(Hi), it's simply a bit harder.
          
          How to calculate the prob of one of multiple exclusive hypothesis 
          under some data?
            D - D1D2...
            Assuming logical independence
              i.e. P(Dk|DjHi) = P(Dk)
            e(Hi|D) = e(Hi) +
              log2(prod(k)(Dk|HiX) / 
              / (sum(j!=i)(prod(k)(P(Dk|HjX))P(Hj|X)) / (sum(j!=i)(P(Hj|X)))))
            
            Why?
              Proof in about 10 lines.
              
              Use of product rule,
              hypothesis exclusivity properties,
              use of logical independence.
            
            Time complexity of calculation?
              O(n * m) to calculate P of all hypotheses after data
              Where
                n - amount of hypotheses
                m - amount of data points
              Assuming O(1) evaluation for P(Di|Hj)
              
              Be vary of float errors tho.
          
          4.4.1 Digression on another derivation
            Jaynes shows a completely different way of getting the formula
            for multi hypothesis testing.
            
            How to do estimates in head for multiple hypothesis testing?
              As long as the probability of hypotheses #3..#n are much less
              likely than #1 and #2,
              the updates for hypotheses #1 and #2 work approximately like
              in the binary case.
              
            How does adding an unlikely hypothesis affect hypothesis testing?
              Unlikely hypotheses usually don't affect the outcome very much.
              But ulikely data can make an unlikely hypotheses likely.
              
            What's the threshold fraction?
              If You've got hard coded hypotheses for frequencies.
              The threshold fraction is the frequency at which the prefered
              hypothesis flips when doing arbitrarily many tests.
            
        4.5 Continuous probability distribution functions
          What's a cumulative distribution function?
            Probability that a variable is less than some value
              i.e. G(q|X) = P(f <= q|X)
          
          What's a cdf?
            cumulative distribution function
            
          formula for P(q1 < f <= q2|X) with cdfs?
            P(q1 < f <= q2|X) = G(q2|X) - G(q1|X)
            
            why?
              Negation rule, and
              Sum of exclusive statements
          
          What's a probability density function?
            G(q|X) = P(f <= q|X)
            g(q|X) = G'q(q|X)
            g(q|X) is the probability density function.
            
            Why are they used?
              They are a good approximation for large systems of exclusive
              hypotheses.
              
              In these situations something like bayes rule approximately 
              applies to pdfs.
              
              And it's easy to calculate the cdf from a pdf.
              
          What's a pdf?
            Probability density function
            Probability distribution function
          
          Can Jaynes prob theory operate with continuous hypotheses?
            Yes.
            See cdfs and pdfs.
          
          Why is probability distribution function a bad name?
            It says that something has a distribution, but
              in reality when we measure has exactly one value.
              The agent's partial knowledge is what has a distribution.
          
          Why is it not correct to have continuous hypotheses for frequencies?
            Unless an infinite amount of things are done,
              which is nearly never the case,
            a frequency will always be a fraction m/n.
            Which to be correct must be calculated discretely not continuously.
            
            But continuous hypotheses are often a good approximation.
            
          How does Jaynes handle discontinuities in cdfs?
            Jaynes proposes using dirac delta function notation to handle
            discontinuities.
            
            What's a dirac delta function?
              Notation abuse.
              iiuc You have to redefine what derivative and integral:
                If You have a function G(q) with some discontinutities
                  at points x1, x2...
                  with size s1, s2...
                Redefine the derivative,
                write it as:
                  G'(q) + s1 * δ(x - x1) + s2 * δ(x - x2)...
                  iiuc
                    this is not a function. This is some very fancy notation
                    abuse, at least in standard math.
                    iiuc
                      It might make sense in Fourier analysis where the concept
                      of a function is redefined.
                    
                    The δ(x) function is intuitively interpreted to be 0
                    everywhere except at 0 where it's infinite, such that
                    integral(a..b)(δ(x)dx) = 1, if 0 e (a..b), but 0 elsewhere.
                    
                    However someone has probably proven that this notation
                    works how it would, if You assumed the intuitive
                    interpretation to be true.
                    
                The integral for a "function" like this
                  g(q) = G'(q) + s1 * δ(x - x1) + s2 * δ(x - x2)...
                is redefined as a limit,
                  integ(a..b)(g(q)) = 
                  Lim(i->inf)(integ(G'(q) + s1 * fi(x - x1) + fi(x - x2)...))
                  
                  where the fi(x - xj) functions are
                    functions that get sharper and sharper around xj
                      as i increases,
                    but have an integral of 1.
                  So in the limit the integral converges to the discontinutiy 
                  jump being included in the integral,
                    if the discontinuity is in the interval,
                  and not,
                    if the discontinuity is not in the inteval.
                  
                  Odd stuff happens when the delta function is on one of the
                  edges of the integral.
            
            Why handle discontinuities this way?
              I thought long and hard, but I really didn't understand why he
              handles discontinuties this way.
              
              My best guess of what Jaynes thinks is:
                If You just use dirac delta functions to model discontinuties,
                the math just works out and keeps many useful properties.
                
                And we use 
                  diract delta functions instead of 
                  just making rules for discontinuities
                because of Jaynes's policy around dealing with infinities.
                  Or maybe it has something to do with Fourier analysis, or
                  something advanced like that.
              
              The limitations to this approach are somewhat unclear to me,
                because I'm not too good at higher math.
              
        4.6 Testing an infinite number of hypotheses
          Bayes formula for a continuous exclusive hypotheses approximation?
            g(f|DX) = g(f|X) * P(D|fX) / P(D|X)
            
            Why?
              I messed around for a while.
              
              When taking a limit of finite systems so that P(f <= a|X) 
              converges to some piecewise continuous function G(a), 
              the continuous approximation is precisely correct.
              
              But I don't really trust my ability to handle calculus.
              So I have a very vague understanding of the limitations of
              this approximation.
                How many times do You have to apply Bayes rule for the error
                to become significant when not going to the limit?
                
                What sets of discontinuities in G(f|X) wouldn't cause 
                paradoxes?
                
                Is the dirac delta thing only used as something that just 
                works, or is there some deeper reason?
              
              I assume in practice this approximation works very well even for
              approximating pretty small systems of exclusive hypotheses.
              
              I feel like Jaynes doesn't explain these things too well.
              Maybe the limitations are obvious if You've taken advanced math
              analysis courses or something.
              
            Hack for not having to caluclating P(D|X)?
              normalize posterior g(f|DX) so that 
                integral(-inf..inf)(g(x) dx) = 1
              
              Why does this work?
                This works in the discrete case.
                
                The continuous case approximation can probably be proven using
                some calculus.
          
          How mathematicians sometimes fail when exploring new theories?
            Mathematicians often don't prove 'obvious' facts,
              so they assume they're true.
            This sometimes leads to paradoxes later on.
            They then conclude the axioms are wrong.
            This is bad.
            
            How does this apply here?
              iiuc
                Mathematicians assume properties about continuous probabilities
                or limits of continuous probabilities.
                They turn out not to be true under perverse conditions.
                They conclude prob theory as an extension of logic is wrong.
                This is bad.
            
            How to fix this?
              Jaynes proposes:
                Every time something new is being done,
                  e.g. 
                    Going from working with finite hypotheses to working with
                    continuous hypotheses.
                derive everything from the axioms or desiderata, or
                understand the limitations where it might break down.
                  And yes I am sort of breaking this by not completely
                  understanding how Jaynes handles continuous functions.
          
          Posterior prob for continuous frequency hypothesis space?
            p - positive outcome count
            n - negative outcome count
            
            r(f) = f^p(1-f)^n g(f|X) - relative likelyhood.
            g(f|DX) =  r(f) / integral(0..1)(r(f) df)
            
            why?
              Bayes theorem for continuous hypotheses, and 
              normalization trick.
          
          Complete beta function?
            intuition?
              Asumming a uniform prior for frequency,
              probability of a frequency f given data:
                p - positive outcomes
                n - negative outcomes
            
            formula?
              g(f|DX) = (p + n + 1)! / (p! * n!) * f^p * (1 - f)^n
            
            why?
              Some combinatorics. 
              To lazy to derive that.
          
          4.6.1 Historical Digression
            Bayes theorem wasn't discovered by Bayes.
            
            Bayes' theorem is part of probability theory as an extension of 
            logic, not the other way around.
            
            de Moivere-Laplace approximation for frequency hypotheses?
              intuitive?
                The beta function tends to a normal distribution around the 
                mean p / (p + n), as p and n go to infinity.
                
                Jaynes says the approximation is very good. Not sure how
                good exactly.
              
              mathematical?
                If
                  p - number of positives
                  n - number of negatives
                Then
                  N = p + n -- total number of tests
                  μ = p / N -- average
                  σ = sqrt(μ(1 - μ) / N) -- standard deviation
                  
                  (N + 1)! / (p! * n!) * f^p * (1 - f)^n ~=
                  (skipping the proof)
                    ~= C * normal_dist(mean = μ, stddev = σ)(f) =
                    = C * e ^ (- ((f - μ) / σ) / 2)
                  Where C is an normalizing constant when integrating 
                  -inf..inf. (It's still a good approximation even though
                  it assumes frequencies can be outside [0..1])
                
              Why?
                I don't know why it's a very good approximation.
                
                I think the proof Jaynes shows for why the approximation
                works in the limit doesn't work.
                
                Imo he doesn't show that the next terms after (f - μ)^2 in
                the taylor series expansion of log(g(f)) don't become more
                significant as n grows larger.
                
                Maybe it's just an obvoious fact from higher math that most
                non-weird functions don't act this way, so while it's not
                a proof, it's a good intuitive argument?
                
              Why do this?
                A normal distribution has some useful properties:
                  Assuming big enough n and p, it's relatively easy to
                  caluclate intervals with an x% likelihood of containing
                  the correct answer.
                  
                  Because the interval within the same amount of standard 
                  deviations from the mean [μ - k * σ, μ + k * σ] contains 
                  the same amount of probability for all normal distributions.
                    e.g.
                      1:1     = 50%   -- k = 0.68
                      1:10   ~= 90%   -- k = 1.65
                      1:100  ~= 99%   -- k = 2.58
                      1:1000 ~= 99.9% -- k = 3.29
                      These values can be caluclated by integrating a norm
                      distribution, which is somewhat complicated.
                      But all math libraries contain functions for that.
                
                How does the stddev change depending on test count?
                  Assuming the frequency stays the same,
                  The stddev is proportional to 1 / sqrt(p + n).
                    i.e. You need 4 times as much data to halve the stddev.
              
            So far only methods for calculation of hypothesis probabilities
            from a hypothesis space on a 1D line, given binary data.
            But these principles can be used for all types of different data
            and hypothesis spaces.
              e.g.
                The hypotheses and data could have multiple dimensions
                somehow connected.
                
                The hypothesis could can take time into account
                  e.g.
                    has the frequency of positives changed?
                
                The data can be continuous.
                  Though this requires some justification I think.
                
                The hypothesis space doesn't even have to be a simple R^n
                type of space, it can be a tree-like structure or a
                fractal or something like that.
                  Though the fractal like any infinite thing would require 
                  some justification.
        
        4.7 Simple and compound (or composite) hypotheses
          What's a parameter space?
            Say we have a hypothesis, which has some parameters.
              e.g. Frequency of positives.
            The parameter space is the set of all parameters that an agent
            considers possible.
              e.g. f e [0..1]
          
          What's a nuisance parameter?
            A hypothesis parameter whose value we don't actually care about, 
            which is used to calculate something we do care about.
              e.g.
                Imagine a rigged coin.
                We want to take the bet if P(Heads|DX) > 0.5, but don't care
                about the precise prob.
            
          How to calculate the probability of any subspace given data?
            You can always just sum or integrate over the subspace conditional
            on the data.
            
            So what?
              This is very hard to justify or do in non-bayesian statistics,
              but is very natural in Jaynsean prob theory.
          
        4.8 Comments
          4.8.1 Etymology
            The logarithm of odds/evidence, is a very useful term.
            
            Stories on how people have tried to name it.
            
            How Jaynes proposes naming it?
              The name of the quantity would be evidence.
              The name of the unit would a decibel, when using a base 10 log.
            
            What do people mean by logits and logistic function?
              Alternative naming to Jaynes's that people use.
              
              The logistic function is what converts probability to evidence.
              A logit is the unit of evidence.
          
          4.8.2 What have we accomplished?
            @Note Draft
              Jaynes says that
                The problems solved in chap 4 might seem simple, because of
                how simply they were solved in this book.
                
                This is because of how the theory was developed.
                
                For non-bayesian approaches solving these problems would be
                a lot harder.
                  e.g.
                    Multiple hypothesis testing is veeeery sketch/non-existant
                    in non-Bayesian statistics.
                    
                    Calculating the probability of a subset of the hypothesis
                    space is somewhat awkward in classic non-Bayesian stats.
              
                This is a result of not thinking of randomness as a 
                property of a system, but as incomplete information.
                
                Jaynes says all of these results come naturally from the fact
                that we assumed prob theory is an extension of logic.
                
                Also classic prob theory messes up by arbitrarily extending
                prob theory to infinte sets, not as a result of the axioms.
                
                
                
          
    Appendix B
      This appendix contains
        Explanations of some important notation nuances.
        Some notes on how traditional math rigor fails in applied math.
      
      B.1 Notation and logicl heirarcy
        Moved mostly to chap 1.7 notes.
        
        Jaynes will mostly 
          use greek letters to represent distribution parameters,
          and latin letters to represent estimates from data.
      
      B.2 Out 'cautious approach' policy
        Moved mostly to chap 2.5 notes.
      
      B.3 Willy Feller on measure theory
        What's measure theory?
          In this context,
            the math that assigns values to infinite sets.
            e.g.
              surface areas,
              volumes,
              volumes in weird spaces like function spaces.
        
        Why not define prob theory on top of measure theory?
          There are practical cases where
            measure theory doesn't work well,
            but making a limiting process that gives correct answers is easy.
          
          e.g. 
            Assigning probabilities to continuous function spaces.
            Jaynes listed three other things I didn't even understand.
        
          Why does Jaynes use measure theory terminology then?
            iiuc
              It allows for shorter proofs and statements for general theorems 
              and problems.
        
        Moved partly to chap 2.5 notes.
        
      B.4 Kronecker vs. Weierstrasz
        Some historical beef stories about Kronecker & Weierstrasz.
          
      B.5 What is a legitimate mathematical function?
        iiuc
        A faliure that sometimes occurs when a new field of math is discovered
          though objects with analogous properties.
          1) New math objects are discovered through some analogous known math 
            objects.
          2) The new ones are defined using the known math objects.
          3) Some of the new objects can't be defined using the known objects.
          4) This leads to some of the new objects not being explored, even 
            though they have practical applications.
          
          e.g.
            Fourier transforms
              1) Discovered though some integrals.
              2) Defined though integrals.
              3) Some integrals didn't converge.
              4) The non-converging Fourier transforms did have useful physical
                interpretations though.
        
          How do people fix this?
            Define the new math objects through their useful properties, not
            through the analogous known objects.
              
            why?
              This definition can still be applied to the known objects, where
              it works,
              but also allows exploration of other objects.
        
        I don't understand Fourier analasys, so I didn't quite understand.
          B.5.1 Delta-functions
          B.5.2 Nondifferentiable functions
          B.5.3 Bogus nondifferentiable functions
          
          very unsure, but iiuc Jaynes was trying to say
            In math anal it's much more practical to
              Define functions in a way convenient to Fourier analysis.
                That's how functions are used in this book.
                If this is done delta functions become legit.
              
              Ignore some types of non-differentiable functions,
                It saves effort.
                  because they don't have any practical applications,
                  and make proofs much more complicated.
                
                I didn't completely understand what exact types of functions.
      
      B.6 Counting infinite sets
        Somewhat unsure, but iiuc Jaynes was trying to say:
          A way that mathematicians fail when handling infinite sets?
            Assigning properties to infinite sets by analogy to finite sets.
            
            why?
              This usually leads to intuitive contradictions, which have to be
              fixed with defining more math.
              e.g.?
                Analogy to finite sets:
                  Traditionally infinite sets have the same size, if there 
                  exists a bijections between them.
                
                Consequences:
                  This means the set of integers is equal in size to the sets 
                  of odd and even integers.
                  
                  But the odd and even integers make up all the integers.
                  
                  So that means that two sets of the integers have the same 
                  size as the set of integers.
                
                The fixes:
                  So You have to say that simple addition and multiplication 
                  works differently for infinite sizes.
            
            Why is this bad?
              Quite usure, but maybe Jaynes meant:
                The analogies chosen for infinite sets are quite arbitrary,
                and the contradictions that come from them are quite arbitrary.
                This leads to infinite sets not being usable in applications
                where different initial analogies would make more sense.
            
            How to fix?
              Jaynes proposes always following the finite sets policy.
              
      B.7 The Hausdorff sphere paradox and mathematical diseases
        Not completely sure, but I think Jaynes was trying to say:
          If some math has a lot of
            paradoxes/things that are super counter intuitive
          that's a good sign the math is useless in practice.
            e.g.
              The Hausdorff paradox shows infinite set theory is trash for mass
              assignment or probability theory, because it contradicts very 
              basic intuitions about how areas should work.

              What's the Hausdorff paradox?
                It's possible to construct three disjoint sets X, Y, Z,
                such that
                  their union is a sphere, 
                    except for countably infinite points,
                  X, Y, Z and union(X, Y) can be transformed into eachother by 
                    a rotation.

      B.8 What am I supposed to publish?
        Jaynes says when writing he
          tries to explain choices he makes that are not obvious,
          but doesn't explain choices that he thinks are actually obvious,
          from a practical application point of view.

        When is mathematical rigor actually bad?
          Usually it isn't.
          
          In applied math
            When the higher abstractions sort of correspond to the math,
            but the lower abstractions don't.
              e.g. 
                Calculating necessary paint sort of works at human scales.
                But the paint has a thickness, which means the integral
                sum break down at sizes < 1mm for odd shapes.
                
                Being rigorous would lead one to calculate in any microscopic 
                bumps on the surface, which would overestimate the paint need.
                Or better yet, conclude the surface is baddly defined, because
                it's made of atoms that aren't continuous.
            
            Looking at whether your higher abstractions are good approximations
            and seem to work is much more important than 
            being rigorous about the axiomatic stuff that doesn't work anyway.
            
      B.9 Mathematical courtesy
        Things to remember when creating a proof for practical applications
          istead of math rigor?
          
          Let's say the theorem is about some objects.
          It's ok if the theorem assumes properties of the objects that are
            likely to be true in the expected applications.
          
          It's ok to not write down very obviously necessary things.
            e.g. 
              If 
                f: R -> R, 
                and You use f(x), 
              You don't have to mention that x e R.
              
              If f(x) is differentiated at some point, it's ok to not write
                that f(x) has to be differentiable, because it's inferable.
              
              Although type checking can reduce errors imo.
              
          It's ok to not look at degenerate cases that are unlikely to show up.
          
          It's ok to make the theorem just general enough to be usable.
            i.e. 
              Wasting effort on something that won't get used anyway is bad.
              And
              It's bad do sacrifice practicality for mathematical generality.
          
          why?
            It just saves effort when writing up proofs,
            and usually makes them easier to understand.
        
        What's the Emancipation Proclamation?
          Janes tried to write the book keeping the previous ideas in mind.
        
        A way in which applied math occasionally fails
          Something useful is discovered.
          Limitations are not fully understood by most mathematicians.
          It becomes culturally accepted to use it in places where it doesn't
            make sense.
          
          implications?
            I guess frequentist stats are full of things like these.
              e.g.
                Confidence intervals on non-normal data.
                p-values on things with extreme priors to begin with.