|
| 1 | +################################################# |
| 2 | +PIC math scratchpad |
| 3 | +################################################# |
| 4 | + |
| 5 | +.. _crappy-pic: |
| 6 | + |
| 7 | +------------------------------------------------------------------- |
| 8 | +Relationship between :math:`\sigma_\cancel{B}` and :math:`\sigma_f` |
| 9 | +------------------------------------------------------------------- |
| 10 | + |
| 11 | +It seems like what we need first of all is some way to get the PITC posterior variance based only on :math:`\cancel{B}` using only the full PITC posterior and calculations that scale only with :math:`|B|`. I'm not totally finished working out how :math:`(\sigma^2_{*})^{\cancel{B} \cancel{B}}` fits into the whole posterior :math:`(\sigma^2_*)^{PIC}` |
| 12 | + |
| 13 | +Observe that since :math:`\mathbf{\Lambda}` is a block-diagonal matrix, its inverse is too, so when it's in the middle of a quadratic form, the cross terms disappear. As far as I can tell, this is the only place we can additively separate :math:`\mathbf{Q}_{\cancel{B} \cancel{B}}` and :math:`\mathbf{K}_{B B}` without explicit cross terms. Start with the usual application of Woodbury's lemma to the PITC posterior covariance: |
| 14 | + |
| 15 | +.. math:: |
| 16 | +
|
| 17 | + \newcommand{Lam}{\mathbf{\Lambda}} |
| 18 | + \newcommand{Kuu}{\mathbf{K}_{uu}} |
| 19 | + \newcommand{Kuf}{\mathbf{K}_{uf}} |
| 20 | + \newcommand{Kfu}{\mathbf{K}_{fu}} |
| 21 | + (\sigma_*^2) &= K_* - \mathbf{Q}_{* f} \left(\mathbf{K}_{fu} \mathbf{K}_{uu}^{-1} \mathbf{K}_{uf} + \mathbf{\Lambda} \right)^{-1} \mathbf{Q}_{f *} \\ |
| 22 | + &= K_* - \mathbf{K}_{*u} \Kuu^{-1} \Kuf \left(\Kfu \Kuu^{-1} \Kuf + \Lam\right)^{-1} \Kfu \Kuu^{-1} \mathbf{K}_{u*} \\ |
| 23 | + &= K_* - \mathbf{K}_{*u} \Kuu^{-1} \mathbf{K}_{u*} + \mathbf{K}_{*u} \left( \Kuu + \Kuf \Lam^{-1} \Kfu \right)^{-1} \mathbf{K}_{u*} |
| 24 | +
|
| 25 | +Now break up :math:`\Kuf` and :math:`\Lam`: |
| 26 | + |
| 27 | +.. math:: |
| 28 | +
|
| 29 | + \Kuf &= \begin{bmatrix} \mathbf{K}_{u \cancel{B}} & \mathbf{K}_{u B} \end{bmatrix} \\ |
| 30 | + \Lam &= \begin{bmatrix} \Lam_{\cancel{B}} & \mathbf{0} \\ \mathbf{0} & \Lam_{B} \end{bmatrix} \\ |
| 31 | +
|
| 32 | +so that |
| 33 | + |
| 34 | +.. math:: |
| 35 | +
|
| 36 | + \Kuf \Lam^{-1} \Kfu &= \begin{bmatrix} \mathbf{K}_{u \cancel{B}} & \mathbf{K}_{u B} \end{bmatrix} |
| 37 | + \begin{bmatrix} \Lam_{\cancel{B}}^{-1} & \mathbf{0} \\ \mathbf{0} & \Lam_{B}^{-1} \end{bmatrix} |
| 38 | + \begin{bmatrix} \mathbf{K}_{\cancel{B} u} \\ \mathbf{K}_{B u} \end{bmatrix} \\ |
| 39 | + &= \mathbf{K}_{u \cancel{B}} \Lam_{\cancel{B}}^{-1} \mathbf{K}_{\cancel{B} u} + |
| 40 | + \mathbf{K}_{u B} \Lam_{B}^{-1} \mathbf{K}_{B u} \\ |
| 41 | + \mathbf{K}_{u \cancel{B}} \Lam_{\cancel{B}}^{-1} \mathbf{K}_{\cancel{B} u} &= \Kuf \Lam^{-1} \Kfu - \mathbf{K}_{u B} \Lam_{B}^{-1} \mathbf{K}_{B u} |
| 42 | +
|
| 43 | +This last one is the inverse term in the training-data dependent part of the PITC posterior covariance using only the training points in :math:`\cancel{B}`, which I think is what we want to combine with the dense posterior from the training points in :math:`B`: |
| 44 | + |
| 45 | +.. math:: |
| 46 | + (\sigma_*^2)^{\cancel{B} \cancel{B}} = K_* - \mathbf{K}_{*u} \Kuu^{-1} \mathbf{K}_{u*} + \mathbf{K}_{*u} \left( \Kuu + \mathbf{K}_{u \cancel{B}} \Lam_{\cancel{B}}^{-1} \mathbf{K}_{\cancel{B} u} \right)^{-1} \mathbf{K}_{u*} |
| 47 | +
|
| 48 | +which we can write in terms of only the full PITC prior and the prior for group :math:`B`: |
| 49 | + |
| 50 | +.. math:: |
| 51 | + (\sigma_*^2)^{\cancel{B} \cancel{B}} = K_* - \mathbf{K}_{*u} \Kuu^{-1} \mathbf{K}_{u*} + \mathbf{K}_{*u} \left( \Kuu + \Kuf \Lam^{-1} \Kfu - \mathbf{K}_{u B} \Lam_{B}^{-1} \mathbf{K}_{B u} \right)^{-1} \mathbf{K}_{u*} |
| 52 | +
|
| 53 | +If you set :math:`B = \emptyset`, removing the correction for the training points in block :math:`B` entirely, you have the original PITC posterior covariance based on the full set of training points :math:`f`, as expected. Now we only have to compute :math:`\Kuf \Lam^{-1} \Kfu` (the term that scales with the full training set :math:`f`) once, and nothing that scales with :math:`|\cancel{B}|`. This still involves an :math:`O(M^3)` decomposition for each group :math:`B`. I have been assuming our group size :math:`|B|` is significantly smaller than the number of inducing points :math:`M`, so it's not great, but an interesting thing happens if we use the "downdate" form of Woodbury's lemma in the opposite direction to the usual way: |
| 54 | + |
| 55 | +.. math:: |
| 56 | + \left(A - B C^{-1} B^T \right)^{-1} &= A^{-1} + A^{-1} B \left(C - B^T A^{-1} B\right)^{-1} B^T A^{-1} \\ |
| 57 | +
|
| 58 | +Letting :math:`\mathbf{A} = \Kuu + \Kuf \Lam^{-1} \Kfu` (playing the part of :math:`A` in Woodbury's lemma): |
| 59 | + |
| 60 | +.. math:: |
| 61 | +
|
| 62 | + (\sigma_*^2)^{\cancel{B} \cancel{B}} &= K_* - \mathbf{K}_{*u} \Kuu^{-1} \mathbf{K}_{u*} + \mathbf{K}_{*u} \left( \mathbf{A}^{-1} + \mathbf{A}^{-1} \mathbf{K}_{u B} \left(\Lam_{B} - \mathbf{K}_{B u} \mathbf{A}^{-1} \mathbf{K}_{u B}\right)^{-1} \mathbf{K}_{B u} \mathbf{A}^{-1} \right) \mathbf{K}_{u*} \\ |
| 63 | + &= K_* - \mathbf{K}_{*u} \Kuu^{-1} \mathbf{K}_{u*} + \mathbf{K}_{*u} \mathbf{A}^{-1} \mathbf{K}_{u*} + \mathbf{K}_{*u} \mathbf{A}^{-1} \mathbf{K}_{u B} \left(\Lam_{B} - \mathbf{K}_{B u} \mathbf{A}^{-1} \mathbf{K}_{u B}\right)^{-1} \mathbf{K}_{B u} \mathbf{A}^{-1} \mathbf{K}_{u*} \\ |
| 64 | +
|
| 65 | +I find the use of Woodbury's lemma to downdate a bit uncomfortable -- does this work with real symmetric matrices? But we would only need to decompose :math:`\mathbf{A}` (and also :math:`\Kuu`) at a cost of :math:`O(M^3)` once; the inversion to be calculated per group costs :math:`O(|B|^3)`. I also don't immediately spot a slick QR decomposition way to do this per-group term; it's pretty symmetrical. The only thing that jumps out is precomputing :math:`\mathbf{K}_{B u} \mathbf{A}^{-\frac{1}{2}}`. |
| 66 | + |
| 67 | +---------------------------------------------- |
| 68 | +Derivation of Woodbury's Lemma for Differences |
| 69 | +---------------------------------------------- |
| 70 | + |
| 71 | +This is specifically for symmetric matrices, to convince myself the downdate formula is legit. We would like to compute a rank-:math:`k` downdate to the inverse of a matrix :math:`A`. |
| 72 | + |
| 73 | +.. math:: |
| 74 | + \left(A - B C^{-1} B^T \right)^{-1} &= A^{-1} + A^{-1} B \left(C - B^T A^{-1} B\right)^{-1} B^T A^{-1} \\ |
| 75 | +
|
| 76 | +To be extra clear, let's say :math:`A \in \mathbb{R}^{n \times n}`, :math:`B \in \mathbb{R}^{n \times k}`, :math:`C \in \mathbb{R}^{k \times k}`. We put the relevant matrices into a block matrix :math:`M \in \mathbb{R}^{m \times m}` for :math:`m = n + k`, and label the conforming blocks of its inverse: |
| 77 | + |
| 78 | +.. math:: |
| 79 | + M &= \begin{pmatrix} A & B \\ B^T & C\end{pmatrix} \\ |
| 80 | + M^{-1} &= \begin{pmatrix} W & X \\ X^T & Y\end{pmatrix} \\ |
| 81 | + M M^{-1} &= \begin{pmatrix} \mathbf{I} & \mathbf{0} \\ \mathbf{0} & \mathbf{I} \end{pmatrix} \\ |
| 82 | + M^{-1} M &= \begin{pmatrix} \mathbf{I} & \mathbf{0} \\ \mathbf{0} & \mathbf{I} \end{pmatrix} |
| 83 | +
|
| 84 | +The blocks of :math:`M M^{-1}` yield the following equations: |
| 85 | + |
| 86 | +.. math:: |
| 87 | + AW + BX^T &= \mathbf{I} \\ |
| 88 | + AX + BY &= \mathbf{0} \\ |
| 89 | + B^T W + CX^T &= \mathbf{0} \\ |
| 90 | + B^T X + CY &= \mathbf{I} \\ |
| 91 | +
|
| 92 | +Rearrange the middle two: |
| 93 | + |
| 94 | +.. math:: |
| 95 | + X &= -A^{-1} B Y \\ |
| 96 | + X^T &= -C^{-1} B^T W |
| 97 | +
|
| 98 | +If we do the same for :math:`M^{-1} M`: |
| 99 | + |
| 100 | +.. math:: |
| 101 | + X &= -W B C^{-1} \\ |
| 102 | + X^T &= -Y B^T A^{-1} |
| 103 | +
|
| 104 | +These blocks are equal: |
| 105 | + |
| 106 | +.. math:: |
| 107 | + C^{-1} B^T W &= Y B^T A^{-1} \\ |
| 108 | + A^{-1} B Y &= W B C^{-1} \\ |
| 109 | +
|
| 110 | +Now use the middle two equations from :math:`M M^{-1}` and plug them into the first and last equations irrespectively: |
| 111 | + |
| 112 | +.. math:: |
| 113 | + W - A^{-1} B C^{-1} B^T W &= A^{1} \\ |
| 114 | + \left(I - A^{-1} B C^{-1} B^T\right) W &= A^{-1} \\ |
| 115 | + -C^{-1} B^T A B Y + Y &= C^{-1} \\ |
| 116 | + \left(I - C^{-1} B^T A^{-1} B\right)Y &= C^{-1} \\ |
| 117 | +
|
| 118 | +Assuming :math:`\left(I - A^{-1} B C^{-1} B^T\right)` and :math:`\left(I - C^{-1} B^T A^{-1} B\right)` are invertible, rearrange: |
| 119 | + |
| 120 | +.. math:: |
| 121 | + W &= \left(I - A^{-1} B C^{-1} B^T\right)^{-1} A^{-1} \\ |
| 122 | + &= \left(A - B C^{-1} B^T\right)^{-1} \\ |
| 123 | + Y &= \left(I - C^{-1} B^T A^{-1} B\right)^{-1} C^{-1} \\ |
| 124 | + &= \left(C - B^T A^{-1} B\right)^{-1} |
| 125 | +
|
| 126 | +Now use equality of the off-diagonal blocks from the two ways to multiply :math:`M` and :math:`M^{-1}` above (don't you wish Sphinx would number equations?) and substitute: |
| 127 | + |
| 128 | +.. math:: |
| 129 | + C^{-1} B^T \left(A - B C^{-1} B^T\right)^{-1} &= \left(C - B^T A^{-1} B\right)^{-1} B^T A^{-1} \\ |
| 130 | + \left(A - B C^{-1} B^T\right)^{-1} B C^{-1} &= A^{-1} B \left(C - B^T A^{-1} B\right)^{-1} \\ |
| 131 | +
|
| 132 | +Let's look just at the term :math:`\left(A - B C^{-1} B^T\right)` and right-multiply by :math:`-A^{-1}`: |
| 133 | + |
| 134 | +.. math:: |
| 135 | + \left(A - B C^{-1} B^T\right) \left(-A^{-1}\right) &= -\mathbf{I} + B C^{-1} B^T A^{-1} \\ |
| 136 | + \mathbf{I} + \left(A - B C^{-1} B^T\right) \left(-A^{-1}\right) &= B C^{-1} B^T A^{-1} |
| 137 | +
|
| 138 | +Now let's return to the previous equation and right-multiply by :math:`B^T A^{-1}`: |
| 139 | + |
| 140 | +.. math:: |
| 141 | + \left(A - B C^{-1} B^T\right)^{-1} B C^{-1} B^T A^{-1} &= A^{-1} B \left(C - B^T A^{-1} B\right)^{-1} B^T A^{-1} \\ |
| 142 | +
|
| 143 | +Substitute the previous result for :math:`B C^{-1} B^T A^{-1}`: |
| 144 | + |
| 145 | +.. math:: |
| 146 | + \left(A - B C^{-1} B^T\right)^{-1} \left( \mathbf{I} + \left(A - B C^{-1} B^T\right) \left(-A^{-1}\right) \right) &= A^{-1} B \left(C - B^T A^{-1} B\right)^{-1} B^T A^{-1} \\ |
| 147 | + \left(A - B C^{-1} B^T\right)^{-1} - A^{-1} &= A^{-1} B \left(C - B^T A^{-1} B\right)^{-1} B^T A^{-1} \\ |
| 148 | + \left(A - B C^{-1} B^T \right)^{-1} &= A^{-1} + A^{-1} B \left(C - B^T A^{-1} B\right)^{-1} B^T A^{-1} \blacksquare |
| 149 | +
|
| 150 | +which is the difference form of Woodbury's lemma, or a rank-:math:`k` downdate of :math:`A^{-1}`. The main assumption seems to be that :math:`\left(I - A^{-1} B C^{-1} B^T\right)` and :math:`\left(I - C^{-1} B^T A^{-1} B\right)` are invertible. |
| 151 | + |
| 152 | +------------------------ |
| 153 | +Blockwise inversion of S |
| 154 | +------------------------ |
| 155 | + |
| 156 | +I tried to invert :math:`\mathbf{S}` blockwise; it didn't work out but I'm leaving it in here just to look back at: |
| 157 | + |
| 158 | +.. math:: |
| 159 | + \newcommand{VV}{\mathbf{V}} |
| 160 | + \mathbf{U} &= \mathbf{Q}_{* \cancel{B}} \mathbf{S}^{-1}_{\cancel{B} B} \VV_{B *} \\ |
| 161 | + &= \mathbf{Q}_{* \cancel{B}} \left( \mathbf{Q}_{\cancel{B} \cancel{B}}^{-1} \mathbf{Q}_{\cancel{B} B} \left(\mathbf{Q}_{B B} - \mathbf{Q}_{B \cancel{B}} \mathbf{Q}_{\cancel{B} \cancel{B}}^{-1} \mathbf{Q}_{\cancel{B} B}\right)^{-1} \right) \VV_{B *} |
| 162 | + |
| 163 | +
|
| 164 | +How are we going to avoid the :math:`O(|\cancel{B}|^3)` inversion for :math:`\mathbf{Q}_{\cancel{B} \cancel{B}}`? Also we are going to get cross-terms with both :math:`\mathbf{Q}_{* \cancel{B}}` and :math:`\mathbf{Q}_{B B}` involved, but maybe they are OK because they are only :math:`O(|\cancel{B}| |B|)`? |
0 commit comments