Jekyll2023-06-27T12:08:28+00:00/feed.xmlMartynas MiliauskasGenerative AI and DataInbreading2023-06-27T00:00:00+00:002023-06-27T00:00:00+00:00/2023/06/27/made-in-GPT<p>90% of the time I code these days, I use Github co-pilot. I have premium ChatGPT for some general shit throwing at the wall. In addition, I also pay for their API for a few scripts that summarize 1-2 hour long YouTube videos for me. LLMs and Generative AI are proliferating fast, more and more I find myself reading articles that are written and have their illustrations generated by diffusion models. For example an article like <a href="https://haroonhaider.wordpress.com/2023/01/26/inflows-into-global-money-market-funds-reach-record-highs-amidst-economic-uncertainty/">this</a>, greets you with the following aesthetics:</p>
<p><img src="https://haroonhaider.files.wordpress.com/2023/01/dhfuguh_super_rich_peoplestock_marketmoneypatienceanalysis_f34e8d2a-8cf2-4af3-9232-c8c1399fa650.jpg" alt="" /></p>
<p>The economics of what is going on right now are clear: we, humans, like cheap shit. For a while China and India were the sources of that. Over a few decades markets all around the world got quickly saturated with cheap crap made there. Almost overnight artisans around the world would find themselves debating whether to buy a Makita drill that costs say $100, or a cheap Chinese version that looks almost identical and has the same specs that sells for say $50. It doesn’t matter that the damn thing breaks in like a week, everyone bought it and continues to buy. If China was a cheap shit source for goods manufacturing, India was the cheap shit source for services: whether you needed a cheap website for your headhunting business or an R&D project delivered “cheaper” than your inhouse team would, you would go to some Indian “company”. The guys there would do everything, “very very cheap and very very good”. For the business and management people who never wrote a line of code in their life, it made sense. For programmers that would always inherit the pile of shit from the East, the realization usually was that “the salary is not the most important thing in life”, and they would usually just jump ships.</p>
<p>Fast forward to 2020s, our services and content driven economy just got a Greek gift – “\(\text{very very cheap and very very good}^{\text{very very cheap and very very good}}\)”. What we have right now is a medium (Internet), where “things” spread at the speed of light (no need to wait for cargo ships, slow trains etc), and a tool (generative AI) that in seconds produces something that otherwise could take at least a couple of hours if one uses ignorance when judging quality. Thus the new wave of mega cheap shit is moving without barriers and the potential for its size is exponential. Basically, the wave gets surprisingly bigger every time we blink.</p>
<p>Clearly, I use LLMs myself, yet there is this clear disdain, so what is the message here? The good days won’t last. Generative AI has a version of Midas touch – everything it touches turns to shit. It is not an existential problem until our Midas will have to eat. One problem that I see discussed nowhere, is that given the economics of GAIs, how long will it take for the training sets to start being overrepresented by content generated by GAIs themselves? I bet that if we were to count all the images on the internet right now, the number of images where humans have 3 or 6 fingers, just went from ~0% to 10% over the last year alone. Obviously, that is a joke, but helps to visualize the problem. If I were to make a serious guess, “data inbreeding” will become a hot topic in the future, the quality of GAI models is close to a peak, pre-GAI datasets will be worth gold.</p>90% of the time I code these days, I use Github co-pilot. I have premium ChatGPT for some general shit throwing at the wall. In addition, I also pay for their API for a few scripts that summarize 1-2 hour long YouTube videos for me. LLMs and Generative AI are proliferating fast, more and more I find myself reading articles that are written and have their illustrations generated by diffusion models. For example an article like this, greets you with the following aesthetics:Making Sense of Gödel’s Incompleteness Theorem2023-04-30T00:00:00+00:002023-04-30T00:00:00+00:00/2023/04/30/making-sense-of-godels-incompleteness-theorem<p>My first encounter with Gödel’s Incompleteness Theorem was when I was reading “Gödel, Escher, Bach: an Eternal Golden Braid” by Douglas Hofstadter. On page 18 he immediately hits you with the following statement:</p>
<blockquote>
<p>This statement of number theory does not have any proof in the system of <em>Principia Mathematica</em>.</p>
</blockquote>
<p>Apparently, Gödel encoded this sentence into a valid mathematical statement of <a href="https://en.wikipedia.org/wiki/Principia_Mathematica">Principia Mathematica</a> (PM) and thus proved that PM is incomplete. What’s also interesting is that PM was designed in such way so that self-reference could not be expressed in its language —</p>
<p><img src="/assets/img/making_sense_of_godels_incompleteness_theorem/huh.gif" width="400" /></p>
<p>First of all, how can you express self-reference using mathematical notation? Second, how can you express that something can or can’t be proved by some system mathematically? Finally, how can you do it in a system that forbids self-reference to begin with?</p>
<p>Personally, I am the level of a mathematician that can only handle “4 is even” statements. That’s easy 😂:</p>
\[\exists e:(SS0*e)=SSSS0\]
<p>But the one above was too meta.</p>
<h2 id="mathematics-is-a-science-of-definitions">Mathematics is a Science of Definitions</h2>
<p>I think the really important thing to internalize about mathematics is that formally as a system it only deals with definitions. A mathematical system (number theory, logic, set theory, etc) always starts with a list of initially defined truths (<em>axioms</em>) and some initially defined <em>rules</em> that let you create new definitions or truths (<em>theorems</em>). Thus you use definitions to create definitions.</p>
<p>The scientific part comes when somebody gives you some strange but correctly expressed definition in a system like number theory (NT); if a chain of rules being applied on some of the axioms can be found that lead to the provided statement, then that statement passes the test of truth.</p>
<h2 id="valid-definitions-in-pm">Valid Definitions in PM</h2>
<p>The condition that PM had for its definitions was that they could not be <em>recursively</em> defined. If you are a programmer, a simple rule to tell whether a definition is allowed in PM or not, is to ask yourself if it can be implemented without while loops or recursion.</p>
<p>For instance, say you have two definitions <code class="language-plaintext highlighter-rouge">SUB(a,a,262)</code> and <code class="language-plaintext highlighter-rouge">PROOF(x,z)</code>. The <code class="language-plaintext highlighter-rouge">SUB(a,a,262)</code> definition says that there is a number which is a result of every sequence of digits ‘262’ being replaced by the number itself. You can implement this with a for-loop of length size(y)/3. The <code class="language-plaintext highlighter-rouge">PROOF(x,z)</code> definition says that $x$ has a path from axioms to $z$, where every step comes from applying one of the defined rules. Both the number of possible axioms, rules, and steps in the path are known ahead of the check, thus again can be implemented by using multiple for-loops. </p>
<p>There are also invalid definitions, like <code class="language-plaintext highlighter-rouge">PROOF(z)</code>. It is invalid because it is impossible to know ahead of time how many times rules will have to be applied. If $z$ is not a theorem, the search might never terminate. You can’t implement this without using the recursion.</p>
<h2 id="completeness-and-consistency">Completeness and Consistency</h2>
<p><em>Completeness</em> and <em>consistency</em> are themselves definitions, or rather meta definitions, since they are definitions about formal systems that deal with definitions. They are as follows:</p>
<ul>
<li>Completeness — all statements that are true in a real (or imaginable world) should also be true.</li>
<li>Consistency — both G & ~G can’t be theorems, e.g.: “0=0” and “~(0=0)” can’t both be true.</li>
</ul>
<p>In a nutshell, completeness means “Can I express all the truths that I see in the world?”, whilst consistency means “There are no logic bombs amongst those truths.”</p>
<h2 id="interpretations-of-definitions">Interpretations of Definitions</h2>
<p>The most important part, in my opinion, to understand the theorem, is to realize that definitions have multiple valid interpretations, even in mathematics. To use a concrete example, say you and your friends use a special code made from numbers that you use in your math class. For example it could look something like this:</p>
<ul>
<li>0 → 666</li>
<li>S → 123</li>
<li>= → 111</li>
<li>…</li>
<li>a → 262</li>
<li>‘ → 163</li>
</ul>
<p>One day your math teacher writes “a=0” on a blackboard and asks you to write the correct number in place of the symbol “a”. The number that you write is “262,111,666”, so the final result looks like this: </p>
\[262,111,666=0\]
<p>As your math teacher is contemplating his existance, your friends can see what just happened. Using your secret interpration they can see that the statement actually reads:</p>
\[(a=0)=0\]
<p>which in itself can also be interpreted as saying “The result of 262,111,666=0 is equal to 0”, which in turn — “<strong>I</strong> am equal to 0”.</p>
<p>Thus, from your math teacher’s point of view you wrote some nonsensical answer, namely that some large number is equal to 0. However, from your friends point of view you wrote a self-referential statement. We can further define the teacher’s interpretation as a literal or <em>mechanical</em>, whilst you and your friend’s interpretation as “in between the lines” or <em>intelligent</em>. In theory, computers are only capable of the mechanical interpretation, whilst you humans can do both. </p>
<p>And this is how Gödel broke PM.</p>
<h2 id="gödels-code">Gödel’s Code</h2>
<p>If you think about it, application of definitions in mathematics is just symbol shunting. All you really do is just add and remove symbols in some predefined manner. That’s what multiplication, addition and division does too. Thus to break PM, Gödel devised a codebook for PM’s different symbols, converting all the axioms into numbers and all the rules as arithmetical operations. All of a sudden, you could write down an NT statements that looked like some innocent addition and multiplication, and NT would think it is just doing arithmetic, whilst in reality it would be talking about itself. </p>
<h2 id="encoding-self-reference-in-pm">Encoding Self-Reference in PM</h2>
<p>First, let’s do a recap. Previously we saw a valid definition of PM called <code class="language-plaintext highlighter-rouge">SUB(a, a, 262)</code>. Just to have some useful visual before we continue, let’s initialize a with some random number and see what we get:</p>
\[\begin{align*}
a&=262,111,666 \\
SUB(a,a,262) &\Rightarrow \textbf{262,111,666},111,666
\end{align*}\]
<p>Cool. Let’s start with a sentence, call it the <em>uncle</em>:</p>
<blockquote>
<p>The formula with Gödel’s number SUB(a,a,262) cannot be proved in PM.</p>
</blockquote>
<p>This can be expressed in the PM’s notation as follows:</p>
\[\sim(\exists x):\text{PROOF}(x, \text{SUB}(a,a,262))\]
<p>Note: <code class="language-plaintext highlighter-rouge">SUB(a,a,262)</code> and Gödel’s number of <code class="language-plaintext highlighter-rouge">SUB(a,a,262)</code>, can be referred to interchangeable, since both are the same thing expressed in a different notation.</p>
<p>$a$ is undefined, in the above formal definition of the sentence, it can be any Gödel’s number. The definition can be converted into Gödel’s number, which we can call u. We can use u to write another sentence, call it <em>G</em> for “Gödel’s sentence”:</p>
\[G=\text{SUB}(u, u, 262)\]
<p>The substitution operation on the number u results in a new number whose English interpretation has the symbol a replaced by the u, thus resulting in the following statement:</p>
<blockquote>
<p>The formula with Gödel’s number SUB(u,u,262) cannot be proved PM</p>
</blockquote>
<p>Again, we can convert this statement into Gödel’s number, call it g. What is that number? Well, it is <code class="language-plaintext highlighter-rouge">SUB(u,u,262)</code> expressed as Gödel’s number. It is a bit loopy, but if you stare at it long enough, after a while the statement “says”</p>
<center>"I can't be proved in PM"</center>
<h2 id="incompleteness-and-undecidability">Incompleteness and Undecidability</h2>
<p>Ok, so is G a theorem? If yes, it means it can’t be proven, thus ~G. That’s inconsistent. If it is ~G, it means that can be proved, thus G. If it is G, then it means… well you get the point, it is undecidable within PM.</p>
<p>However, we know that the statement is true in reality (outside the system), because that’s exactly what it is trying to tell us, that it can’t be proven! So PM is incomplete — there exists a truth in the real world that doesn’t “exist” in PM.</p>
<h2 id="in-retrospect">In Retrospect</h2>
<p>Once the proof “clicks”, it starts to make sense why Hofstadter chose this theorem as a centerpiece of the masterpiece that GEB is. It also makese sense why he spent so much time further writing about it and discussing it afterwards. If you stare long enough into the proof, the proof stares back at you.</p>
<p>In the book itself Hofstadter discusses how this theorem is like one of those optical illusions:</p>
<p><img src="/assets/img/making_sense_of_godels_incompleteness_theorem/Spinning_Dancer.gif" width="300" /></p>
<p>If you stare long enough the rotation direction switches, it goes from clockwise to counter-clockwise, then back again. Similarly the the proof of the Gödel’s theorem, if you keep <em>looping</em> back and forth between different levels of interpretation you can actually start to see how “I” <strong>emerges</strong>. Thus the theorem itself can be used as a potential explanation of how the conscious “I” can appear seemingly out of a dead/inanimate space. No matter, if it is some biological mush or electronic circuit.</p>My first encounter with Gödel’s Incompleteness Theorem was when I was reading “Gödel, Escher, Bach: an Eternal Golden Braid” by Douglas Hofstadter. On page 18 he immediately hits you with the following statement:How Good are Earnings Forecasts?2022-12-06T00:00:00+00:002022-12-06T00:00:00+00:00/2022/12/06/how-good-are-earnings-forecasts<p>Earnings forecasts made by analysts play a significant role in the financial world, as they provide investors and stakeholders with information about a company’s expected performance. These forecasts are often used to inform investment decisions and can have a significant impact on a company’s stock price. Almost all of the stock valuation models incorporate them. However, nobody has a crystal ball, so we turn to the next best thing - oracles, aka analysts.</p>
<p>Since these forecasts play such an important part, I wonder how accurate they tend to be. The following is a meta-study of various scientific papers that have explored this question.</p>
<h2 id="bullish-bias">Bullish bias</h2>
<p>We can start with De Bondt et al. (1990) who took yearly forecasts from 1976-1984. They note that the actual earnings are 65% of what was predicted at the start of the year, and they drop to 46% for the 2 year horizon. Loh et al. (2003) similarly found that actual earnings represent only 64% of the one year forecasts in their study of earnings forecasts during the Asian crisis. They also split forecasts into pre and post-crisis and ran a regression (\(AC_t = \beta FC_t + \epsilon_t\)) to see how much the actual earnings change was represented by the forecasted change. The resulting \(\beta\) was 0.98 during normal times and 0.028 during the crisis. So the <strong>analysts got the actual change on the money during normal times, and completely missed during the crisis period</strong>. Loh et al. (2003) also showed that during the analyzed period from 1990 to 1999, 7 out of 10 years yearly forecasts were positively biased, mostly during the crisis years:</p>
<p><img src="/assets/img/how_accurate_earnings_forecasts/forecast_error_by_year_loh_2013.png" width="400" /></p>
<p>We saw how forecasters did during the Asian crisis, but maybe things improved afterwards? Hutira (2016) analyzed earnings forecasts for the U.S. based companies for the 2000-2013 period using various forecasting horizons. For every horizon he plotted an average forecasting error (see the chart below). Again, we can observe that during the crisis errors spike and have to be revised downwards the most:</p>
<p><img src="/assets/img/how_accurate_earnings_forecasts/average_forecast_error_by_horizon_hutira_2016.png" width="500" /></p>
<p>Sidhu et al. (2011) have studied analysts’ forecasts starting with the dot-com bubble bottom in 2003 all the way to the start of the 2008 crash. What they have found was that both for quarterly and yearly forecasts analysts grew increasingly more bullish as the bubble grew. The chart below shows how the buy signal grew and the sell signal shrank even as the bubble started to pop.</p>
<p><img src="/assets/img/how_accurate_earnings_forecasts/buy_sell_recommendations_sidhu_2011.png" width="400" /></p>
<p>The asymmetry of buy and sell signal proportions is also noteworthy. The percentage of analysts suggesting to “buy” was in the 40-60% range throughout the bull run, whilst the “sell” camp stayed around 10%.</p>
<h2 id="no-better-than-weigthed-past-episodes">No better than weigthed past episodes</h2>
<p>It is my long time suspicion that quantitatively the best tool humanity has when it comes to predicting the future is a linear combination of past episodes. In addition to that, Daniel Kahneman and Amos Tversky (1973) found that people tend to overweight the most recent episode and underweight long-termm averages.</p>
<p>Whilst it is true that for short term horizons (< quarter), analysts tend to be more accurate than simple time-series models, the results become much more interesting as we try to test for larger horizons. For instance, Bradshaw et al. (2012) compared a simple random walk model (\(E_T(\text{EPS}_{T+\tau}) = EPS_T \in \tau = \{1,2,3\}\)) for 1,2 and 3 year horizons with analysts forecasts. They found that for all three time horizons the absolute error is identical. The only thing that is different is the bias: analysts are overoptimistic, random walk model overpessimistic:</p>
<p><img src="/assets/img/how_accurate_earnings_forecasts/descriptive_statistics_bradshaw_2012.png" width="500" /></p>
<p>So a simple lag function is just as good as professional analysts once we try to expand our horizon to a one year mark and beyond.</p>
<p>It that for yearly and further horizons we can live without the analysts, but what about smaller time horizons, how about a quarter? Lorek et al. (2014) found that an <a href="https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average">ARIMA</a> model makes forecasts that are just or more accurate than those made by analysts <strong>39.4%</strong> of the time. The accuracy increases if one goes from one to three quarters.</p>
<h2 id="conclusion">Conclusion</h2>
<p>To summarize, it seems that analysts can not beat simple time-series models for time horizons larger than a quarter. Their half-year and yearly forecasts tend to have largest errors during market crashes. However, as depicted by Hutira (2016), forecasts with shorter than a quarter horizon tend to be quite accurate.</p>
<h2 id="reference">Reference</h2>
<ul>
<li>De Bondt, W. F., & Thaler, R. H. (1990). Do security analysts overreact?. The American economic review, 52-57.</li>
<li>Loh, R. K., & Mian, M. (2003). The quality of analysts’ earnings forecasts during the Asian crisis: evidence from Singapore. Journal of Business Finance & Accounting, 30(5‐6), 749-770.</li>
<li>Sidhu, B., & Tan, H. C. (2011). The performance of equity analysts during the global financial crisis. Australian Accounting Review, 21(1), 32-43.</li>
<li>Lorek, S. K., & Pagach, D. P. (2014). Analysts versus time-series forecasts of quarterly earnings: A maintained hypothesis revisited. Recuperado de https://ssrn. com/abstract, 2406013.</li>
<li>Bradshaw, M. T., Drake, M. S., Myers, J. N., & Myers, L. A. (2012). A re-examination of analysts’ superiority over time-series forecasts of annual earnings. Review of Accounting Studies, 17(4), 944-968.</li>
<li>Hutira, S. (2016). Determinants of analyst forecasting accuracy.</li>
<li>Kahneman, D., & Tversky, A. (1973). On the psychology of prediction. Psychological review, 80(4), 237.</li>
</ul>Earnings forecasts made by analysts play a significant role in the financial world, as they provide investors and stakeholders with information about a company’s expected performance. These forecasts are often used to inform investment decisions and can have a significant impact on a company’s stock price. Almost all of the stock valuation models incorporate them. However, nobody has a crystal ball, so we turn to the next best thing - oracles, aka analysts.The Uncertain Side of Econometrics2022-06-20T00:00:00+00:002022-06-20T00:00:00+00:00/economics/2022/06/20/uncertain-side-of-econometrics<p>Financial media, central bankers and even most of the economists talk about various economic parameter estimates as if they are certainties. For instance, you always hear things like “the last quarter GDP growth was at 1.2%”, “inflation is highest since 1985”, etc. In this post I look into different types of uncertainty that comes with economic estimates, namely GDP growth, unemployment rate and CPI.</p>
<h2 id="gdp-growth">GDP growth</h2>
<p>The GDP growth number that is reported in financial media at the start of a quarter is released by the Bureau of Economic Analysis (BEA) and it is just a first estimate. It is made with about 25% of the data still missing (since the data in the service sector is not yet available). The missing data is extrapolated from the past trends. As the data continues to arrive the estimates are updated and reported as second and third estimates that are made two and three months after the initial release. It is also possible to see the most recent estimates as of today. Fixler et al. (2011) report the mean absolute revision (MAR) for annualized quarterly estimates to be 1.31 (advance), 1.29 (second) and 1.32% (third). When it comes to extreme cases Òscar Jordà et al. (2020) note that:</p>
<blockquote>
<p>…fourth quarter of 2008 … initially listed at -3.8% annual rate … revised down … to an eye-watering -8.4% annual rate … had officials known the actual depth of the recession in real time, they might have voted for a larger fiscal stimulus package, for example.</p>
</blockquote>
<p>They also note that:</p>
<blockquote>
<p>In general, the larger the advance estimate, the smaller the revision. However, the relationship is asymmetric: negative growth rates are, on average, followed by larger revisions, in absolute value.</p>
</blockquote>
<p>The <a href="https://www.philadelphiafed.org/surveys-and-data/real-time-data-research/first-second-third">data</a> is provided by the Federal Reserve Bank of Philadelphia and we can have a look how accurate these estimates are ourselves. For instance, what is the distribution of absolute revisions when a negative first estimate is revised down even further in the most recent estimate?</p>
<p><img src="/assets/img/uncertain_econometrics/mars_for_negative_first_and_most_recent.png" alt="absolute revisions for negative firt and most recent" /></p>
<p>Most frequently a negative first estimate is eventually revised down by 1-1.5%. In the most extreme case it was revised down by 4.7% in 2008.</p>
<p>Another interesting question to explore is how many cases there are when the initial estimate was positive (economy was growing) whilst the most recent one is negative (it was actually shrinking)?</p>
<table>
<thead>
<tr>
<th style="text-align: right">Date</th>
<th style="text-align: right">First</th>
<th style="text-align: right">Second</th>
<th style="text-align: right">Third</th>
<th style="text-align: right">Most recent</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">1973-07-01</td>
<td style="text-align: right">3.546</td>
<td style="text-align: right">3.399</td>
<td style="text-align: right">3.399</td>
<td style="text-align: right">-2.093</td>
</tr>
<tr>
<td style="text-align: right">1980-07-01</td>
<td style="text-align: right">0.998</td>
<td style="text-align: right">0.883</td>
<td style="text-align: right">2.372</td>
<td style="text-align: right">-0.471</td>
</tr>
<tr>
<td style="text-align: right">1982-07-01</td>
<td style="text-align: right">0.760</td>
<td style="text-align: right">0.000</td>
<td style="text-align: right">0.733</td>
<td style="text-align: right">-1.526</td>
</tr>
<tr>
<td style="text-align: right">2001-01-01</td>
<td style="text-align: right">1.982</td>
<td style="text-align: right">1.318</td>
<td style="text-align: right">1.245</td>
<td style="text-align: right">-1.291</td>
</tr>
<tr>
<td style="text-align: right">2008-01-01</td>
<td style="text-align: right">0.597</td>
<td style="text-align: right">0.901</td>
<td style="text-align: right">0.959</td>
<td style="text-align: right">-1.619</td>
</tr>
<tr>
<td style="text-align: right">2011-01-01</td>
<td style="text-align: right">1.748</td>
<td style="text-align: right">1.842</td>
<td style="text-align: right">1.915</td>
<td style="text-align: right">-0.961</td>
</tr>
<tr>
<td style="text-align: right">2011-07-01</td>
<td style="text-align: right">2.464</td>
<td style="text-align: right">2.004</td>
<td style="text-align: right">1.815</td>
<td style="text-align: right">-0.154</td>
</tr>
<tr>
<td style="text-align: right">2014-01-01</td>
<td style="text-align: right">0.108</td>
<td style="text-align: right"><strong>-0.985</strong></td>
<td style="text-align: right"><strong>-2.930</strong></td>
<td style="text-align: right">-1.395</td>
</tr>
</tbody>
</table>
<p>The above table shows that only once the second and third revisions manage to catch actual negative growth. The dates of being wrong are also noteworthy since there were recessions in 1973-1975, 1980, 1982, 2001, 2008. 2011 saw the European debt crisis.</p>
<p>To conclude, we can clearly see that there is uncertainty attached with not just the first, but also second and third GDP growth estimates. We can also see that the estimates are at their worst when it matters the most. This type of uncertainty due to time it takes to collect the data Manski (2014) calls <em>transitory statistical uncertainty</em>.</p>
<h2 id="unemployment-rate">Unemployment rate</h2>
<p>Another source of error is the so-called <em>permanent statistical uncertainty</em> that is created by incomplete, missing or wrong information in collected data. The unemployment rate is impacted by this because it is measured using the responses of the Current Population Survey (CPS), where a sample of households answers various questions, including the employment situation. The CPS surveys households for four consecutive months, removes them from the sampling pool for eight months, then surveys again for four consecutive months. Households have a choice not to respond during any of those eight months. When there is no response to some or all of the questions, that data is permanently lost. In order to fill in these gaps, the Census Bureau does something called <em>hot-deck</em> imputations. Basically, they extrapolate the missing responses based on the responses of “similar” (based on age, race, sex, etc) individuals in the sampling pool. It is not known how close these data imputations mimic the population. The document by Census Bureau (2011), notes this:</p>
<blockquote>
<p>Some people refuse the interview or do not know the answers. When the entire interview is missing, other similar interviews represent the missing ones … For most missing answers, an answer from a similar household is copied. The Census Bureau does not know how close the imputed values are to the actual values.</p>
</blockquote>
<p>An interesting exercise is to see the unit non-response rate (data is completely missing) over the past 20 years:</p>
<p><img src="/assets/img/uncertain_econometrics/cps_non_response_by_type.png" alt="CPS non-response by type" /></p>
<p>The Type A non-responses are non-interviews (e.g. no one home, language barrier, <strong>refused</strong>, etc), Type B and Type C are due to temporarily or permanently unoccupied housing units. As we can see from the chart above the Type B and Type C rates stayed the same throughout the years, however the Type A rate grew from 4% to 12%! In general the non-response rate has grown from 12% to 20% since 2010.</p>
<p>How much does the refusal to participate contribute to Type A non-responses?</p>
<p><img src="/assets/img/uncertain_econometrics/type_a_due_to_refusal.png" alt="Type A due to refusal" /></p>
<p>As we can see the refusal to participate completely dominates the Type A category. Analysis done by Bernhardt et al. (2021) points to the blue states and states with higher rural population as contributors to lower non-response rate. Whilst the states with lost manufacturing industries and younger population (weak correlation) tend to contribute to higher refusal rate.</p>
<p>We can clearly observe that the fraction of refusal to participate in the survey is growing somewhat at an exponential pace. One naive way to quantify the uncertainty that this loss of data introduces is by estimating the historic lower and upper bounds of the unemployment rate. Whilst Manski (2013) notes that these bounds are “distressingly wide”, they are however maximally credible due to no prior assumptions being imposed on them – “wide bounds reflect real uncertainty that cannot be washed away by assumptions lacking credibility”. With that in mind we can see how those bounds of uncertainty would look:</p>
<p><img src="/assets/img/uncertain_econometrics/unemployment_rate_uncertainty.png" alt="Unemployment rate uncertainty" /></p>
<p>The unemployment rate in the above chart is calculated for March of every year. The blue line is the official point estimate of the unemployment rate. The gray line is the unemployment rate without missing data being extrapolated. The dashed lines are the lower and upper bounds of the estimate distribution. The lower bound is computed by assuming that the Type A non-responses are employed, whilst the upper bound assuming that they are not. The graph above should not be interpreted as if the range between 2.5% and 17.5% is some sort of confidence interval of the actual unemployment rate. Rather that whatever the distribution is, these are its bounds. Its mass could very well be centered around the currently reported unemployment rate (we just do not know that). However, what is noteworthy is that whatever the distribution is, the error of the quantity it is estimating continues to grow since 2010.</p>
<h2 id="cpi">CPI</h2>
<p>The final source of uncertainty in the economic data is the conceptual uncertainty. This type of uncertainty arises due to ambiguous or lacking definitions of something that is being measured. A good example of this type of an error is the consumer price index (CPI). If we plot the year-over-year inflation of the index, what we get looks almost like two charts glued together:</p>
<p><img src="/assets/img/uncertain_econometrics/cpi.png" alt="CPI" /></p>
<p>The red line in the figure above marks Jan 1983. Prior to that date the CPI included the price of “consuming” homeownership, which was based on house prices, mortgage rates, property taxes, insurance, and maintenance costs. It constituted a total of 26.1% of the index. Then the index was modified to instead track something called owner’s equivalent rent (OER), which was measured by surveying real estate owners and asking them the following question:</p>
<blockquote>
<p>If someone were to rent this (including part of the property currently being used for business, farming, or render/home today) how much do you think it would rent for monthly, unfurnished and without utilities? – CEQ (2021)</p>
</blockquote>
<p>At the beginning OER made only 14.5% of the index and now it is at 24.3%. The reason why this change was made was because the old CPI was mechanically attached to overnight interest rates controlled by the Fed through mortgage rates. This made fighting the inflation in the 70s and 80s like chasing your own tail. Every time the Fed would raise the interest rates the CPI would immediately follow upwards and it would artificially drop as the interest rates dropped.</p>
<p>The housing market related change was not the first nor the last change made to the index. The full list can be found <a href="https://www.bls.gov/cex/capi/2021/2021-CEQ-CAPI-instrument-specifications.pdf">here</a>.</p>
<p>Recently Bolhouis et al. (2022) published a paper trying to backcast consumer index with OER before 1983 and all the way to 1950s in order to see how inflation would have looked back then if it used owner’s equivalent rent all along:</p>
<p><img src="/assets/img/uncertain_econometrics/cpi_oer_corrected.png" alt="CPI OER corrected" /></p>
<p>The way the corrected version was generated was by taking rent inflation and using that instead of shelter prior to 1983. There is a problem with this backcast since the Fed officials did not see the “corrected” chart, therefore in a parallel universe where OER was used from the start, the rent inflation might have looked a lot different, making decisions different and thus future different. In any case, I think the adjusted version does succeed showign potential inaccuracies hidden in the official chart.</p>
<p>Bolhouis et al. (2022) have also backcasted CPI using weights from different periods, for instance here is one using the most recent weights:</p>
<p><img src="/assets/img/uncertain_econometrics/cpi_2022_corrected.png" alt="CPI 2022 weights" /></p>
<p>The paper has also mentioned some examples of potential pitfalls when trying to compare current inflation with different historic periods using the official chart:</p>
<blockquote>
<p>…in arguing against policy makers falling “behind the curve” in the face of rising inflation, Blanchard (2022) showed that today’s gap between core inflation – which removes volatile food and energy prices – and real interest rates is approaching about 70 percent of the 1975 gap. We argue against using official CPI inflation to assess this gap… pre-1983 peak CPI inflation measures, especially during the Volcker-era, [were] artificially high at the beginning of the tightening cycle, and declines look artificially fast.
…Recent work suggests that the years following World War II have strong similarities to the current inflation environment (e.g. Rouse et al., 2021; DeLong, 2022). We show that due to the greater weight of transitory goods components – especially food and apparel – in the index of the 1940s and 1950s, past inflation spikes were higher and more short-lived than today’s.</p>
</blockquote>
<h2 id="conclusion">Conclusion</h2>
<p>As I was looking into this, it sort of started to make sense why people do not like to impose confidence intervals when it comes to economic data. Given the size of the uncertainty, nobody would be able to neither make decisions, nor build narratives around the data. It is almost as if “it does not matter if it is right or wrong, just give me a number to work with”.</p>
<h2 id="references">References</h2>
<ul>
<li>Bernhardt, R., Munro, D., & Wolcott, E. (2021). How Does the Dramatic Rise of CPS Non-Response Impact Labor Market Indicators? (No. 781). GLO Discussion Paper.</li>
<li>Bolhuis, M. A., Cramer, J. N., & Summers, L. H. (2022). Comparing Past and Present Inflation (No. w30116). National Bureau of Economic Research.</li>
<li><a href="https://www.bls.gov/cex/capi/2021/2021-CEQ-CAPI-instrument-specifications.pdf">Consumer Expenditure Surveys Interview Questionnaire</a> (CEQ) (2021), p116</li>
<li>Fixler, D. J., Greenaway-McGrevy, R., & Grimm, B. T. (2011). Revisions to GDP, GDI, and their major components. Survey of Current Business, 91(7), 9-31.</li>
<li>Manski (2014), Credible Interval Estimates For Official Statistics With Survey Nonresponse</li>
<li>Manski (2013), Communicating Uncertainty In Official Economic Statistics</li>
<li>Òscar Jordà et al. (2020), The Fog of Numbers, <a href="https://www.frbsf.org/economic-research/publications/economic-letter/2020/july/fog-of-numbers-gdp-revisions/">https://www.frbsf.org/economic-research/publications/economic-letter/2020/july/fog-of-numbers-gdp-revisions/</a>.</li>
<li>U.S. Census Bureau (2011), Current Housing Reports, Series H150/09, American Housing Survey for the United States: 2009, Washington, DC: U.S. Government Office.</li>
</ul>Financial media, central bankers and even most of the economists talk about various economic parameter estimates as if they are certainties. For instance, you always hear things like “the last quarter GDP growth was at 1.2%”, “inflation is highest since 1985”, etc. In this post I look into different types of uncertainty that comes with economic estimates, namely GDP growth, unemployment rate and CPI.Notes on Simulated Annealing2022-01-28T00:00:00+00:002022-01-28T00:00:00+00:00/artificial-intelligence/2022/01/28/simulated-annealing<h2 id="metropolis-algorithm-for-simulating-the-evolution-of-a-solid-in-a-heat-bath-to-thermal-equilibrium">Metropolis algorithm for simulating the evolution of a solid in a heat bath to thermal equilibrium</h2>
<p>The algorithm samples a state space by perturbating current <em>i</em> and transtioning to the perturbed state <em>j</em> with prob = 1 if the energy of the state is lower or with prob = \(\text{exp}(\frac{E_i - E_j}{k_BT})\) otherwise.</p>
<p>If you sample long enough the probability distribution of visiting every state approaches the <em>Boltzmann distribution</em>:</p>
\[P(X=i) = \frac{e^{\frac{-E_i}{k_BT}}}{\sum_j e^{\frac{-E_j}{k_BT}}}\]
<h2 id="simulated-annealing">Simulated annealing</h2>
<p>Compared to the Metropolis algorithm, simulated annealing uses cost of a state $f(i)$ as energy $E_i$ and acceptance probability without the <em>Boltzmann’s constant</em> $k_B$:</p>
\[P_c\{\text{accept j}\} = \left\{\begin{matrix}
1 & \text{ if } f(j) \leq f(i)
\\
e^{\frac{f(i) - f(j)}{c}} & \text{ else}
\end{matrix}\right.
\label{eq:2.4} (2.4)\]
<h3 id="is-it-optimal">Is it optimal?</h3>
<p>In other words, can it find a global optimum?</p>
<p><strong>Conjecture 2.1</strong> Given and instance (S,f) of a combinatorial optimization problem and a suitable neighborhood structure then, after a sufficiently large number of transitions at a fixed value of <code class="language-plaintext highlighter-rouge">c</code>, applying the acceptance probability of (2.4), the simulated annealing algorithm will find a solution $i \in S$ with a probability equal to:</p>
\[P_c\{X = i\} = q_i(c) = \frac{e^{-\frac{f(i)}{c}}}{\sum_{j \in S}e^{-\frac{f(j)}{c}}}
\label{eq:2.5} (2.5)\]
<p><span class="marginnote">Basically if you were to random walk using simulated annealing algorithm at temperature c long enough, the prob of visiting every state will converge to $P_c(X=i)$</span></p>
<p>where $X$ is a stochastic variable denoting the current solution obtained by the simulated annealing.</p>
<p>MATLAB code that checks this using 4-queens problem:</p>
<div class="language-matlab highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">%% sample using simulated annealing</span>
<span class="n">solution</span> <span class="o">=</span> <span class="nb">randperm</span><span class="p">(</span><span class="mi">4</span><span class="p">);</span>
<span class="n">score</span> <span class="o">=</span> <span class="n">fitness</span><span class="p">(</span><span class="n">solution</span><span class="p">);</span>
<span class="n">N</span> <span class="o">=</span> <span class="mi">10000</span><span class="p">;</span>
<span class="n">H</span> <span class="o">=</span> <span class="p">[</span><span class="n">solution</span><span class="p">];</span>
<span class="nb">count</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">];</span>
<span class="n">t</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">:</span><span class="n">N</span>
<span class="n">succ</span> <span class="o">=</span> <span class="n">mutate</span><span class="p">(</span><span class="n">solution</span><span class="p">);</span>
<span class="n">fsucc</span> <span class="o">=</span> <span class="n">fitness</span><span class="p">(</span><span class="n">succ</span><span class="p">);</span>
<span class="n">fsol</span> <span class="o">=</span> <span class="n">fitness</span><span class="p">(</span><span class="n">solution</span><span class="p">);</span>
<span class="k">if</span> <span class="n">fsucc</span> <span class="o"><=</span> <span class="n">fsol</span> <span class="o">||</span> <span class="nb">rand</span> <span class="o"><=</span> <span class="nb">exp</span><span class="p">((</span><span class="n">fsol</span> <span class="o">-</span> <span class="n">fsucc</span><span class="p">)/</span><span class="n">t</span><span class="p">)</span>
<span class="n">solution</span> <span class="o">=</span> <span class="n">succ</span><span class="p">;</span>
<span class="k">end</span>
<span class="p">[</span><span class="n">tf</span><span class="p">,</span> <span class="n">index</span><span class="p">]</span> <span class="o">=</span> <span class="nb">ismember</span><span class="p">(</span><span class="n">solution</span><span class="p">,</span> <span class="n">H</span><span class="p">,</span> <span class="s1">'rows'</span><span class="p">);</span>
<span class="k">if</span> <span class="n">index</span> <span class="o">==</span> <span class="mi">0</span>
<span class="n">H</span> <span class="o">=</span> <span class="p">[</span><span class="n">H</span><span class="p">;</span> <span class="n">solution</span><span class="p">];</span>
<span class="nb">count</span> <span class="o">=</span> <span class="p">[</span><span class="nb">count</span><span class="p">;</span> <span class="mi">1</span><span class="p">];</span>
<span class="k">else</span>
<span class="nb">count</span><span class="p">(</span><span class="n">index</span><span class="p">)</span> <span class="o">=</span> <span class="nb">count</span><span class="p">(</span><span class="n">index</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="c1">%% check if probs match Boltzman's distribution</span>
<span class="n">pbar</span> <span class="o">=</span> <span class="nb">count</span><span class="p">/</span><span class="nb">sum</span><span class="p">(</span><span class="nb">count</span><span class="p">);</span>
<span class="n">f</span> <span class="o">=</span> <span class="nb">zeros</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="nb">length</span><span class="p">(</span><span class="n">H</span><span class="p">));</span>
<span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">:</span><span class="nb">length</span><span class="p">(</span><span class="n">H</span><span class="p">)</span>
<span class="n">f</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="o">=</span> <span class="n">fitness</span><span class="p">(</span><span class="n">H</span><span class="p">(</span><span class="n">i</span><span class="p">,:));</span>
<span class="k">end</span>
<span class="n">z</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="nb">exp</span><span class="p">(</span><span class="o">-</span><span class="n">f</span><span class="p">/</span><span class="n">t</span><span class="p">));</span>
<span class="n">p</span> <span class="o">=</span> <span class="nb">exp</span><span class="p">(</span><span class="o">-</span><span class="n">f</span><span class="p">/</span><span class="n">t</span><span class="p">)/</span><span class="n">z</span><span class="p">;</span>
<span class="nb">plot</span><span class="p">(</span><span class="n">pbar</span><span class="p">,</span> <span class="s1">'DisplayName'</span><span class="p">,</span><span class="s1">'pbar'</span><span class="p">);</span>
<span class="nb">hold</span> <span class="n">on</span>
<span class="nb">plot</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="s1">'DisplayName'</span><span class="p">,</span> <span class="s1">'p'</span><span class="p">)</span>
<span class="nb">legend</span>
</code></pre></div></div>
<p>Result:</p>
<p><img src="https://marti-1.s3.amazonaws.com/notes/simulated_annealing/sa_boltzmann_dist.png" alt="image" /></p>
<p>A set of useful quantities can be derived from 2.5.</p>
<p>The <em>expected cost</em>:</p>
\[\begin{aligned}
E_c(f) &\triangleq ~ <f>_c \\
&= \sum_{i \in S}f(i)P_c\{X=i\} \\
&= \sum_{i \in S}f(i)q_i(c)
\end{aligned}\]
<p>The <em>expected squared cost</em>:</p>
\[\begin{aligned}
E_c(f) &\triangleq ~ <f^2>_c \\
&= \sum_{i \in S}f^2(i)P_c\{X=i\} \\
&= \sum_{i \in S}f^2(i)q_i(c)
\end{aligned}\]
<p><em>Variance</em>:</p>
\[\begin{aligned}
\text{Var}_c(f) &\triangleq ~ \sigma^2_c \\
&= <f^c> - <f>^2_c
\end{aligned}\]
<p><em>Entropy</em>:</p>
\[S_c = -\sum_{i \in S}q_i(c)\text{ln}(q_i(c))\]
<p>In the case of simulated annealing, or optimization in general, the entropy can be interpreted as a quantitative measure for the degree of optimality.
<span class="marginnote">The more certainty there is for the system to stay in a certain state, the smaller the entropy, and the higher the optimality.</span></p>
<p><strong>Corollary 2.1</strong> Given an instance (S,f) of a combinatorial optimization problem and a suitable neighborhood structure. Furthermore, let the stationary distribution be given by (2.5), then:</p>
\[\text{lim}_{c\rightarrow 0}q_i(c) = q_i^{*} = \frac{1}{|S_{opt}|}\chi (S_{opt})(i)\]
<p><span class="marginnote">Basically the 2.5 collapses into a uniform dist of optimal solutions – system can only be at one of those optimal states</span></p>
<p>where \(S_{opt}\) denotes the set of globally optimal solutions. Let A and \(A \subset A'\) be two sets. Then the characteristic function \(\chi(A'):A \rightarrow \{0, 1\}\) of the set \(A'\) if defined as \(\chi_{A'}(a) = 1\) if \(a \in A'\) and \(\chi_(A')(a) = 0\) otherwise.</p>
<p><em>Proof</em>: Using the fact that for all \(a \leq 0\) \(\text{lim}_{x \rightarrow 0} e^{a/x} = 1\) if a = 0 (\(e^{0/x}=e^0=1\)), and 0 otherwise:</p>
<p><img src="https://marti-1.s3.amazonaws.com/notes/simulated_annealing/exp_minus_1_div_x.png" alt="image" /></p>
<p>and the following facts:</p>
<p><strong>(1)</strong></p>
\[e^{\frac{(f_{opt}-f_{opt})}{c}} = 1\]
<p><strong>(2)</strong></p>
\[\begin{aligned}
\text{lim}_{c \rightarrow 0} \sum_{j \in S}e^{\frac{(f_{opt}-f(j))}{c}} &= \text{lim}_{c \rightarrow 0} (\sum_{j \in S_{opt}}e^{\frac{(f_{opt}-f_{opt})}{c}} + \sum_{j \in S \backslash S_{opt}}e^{\frac{-f(i)}{c}}) \\
&= \text{lim}_{c \rightarrow 0} (\sum_{j \in S_{opt}}e^{\frac{(f_{opt}-f_{opt})}{c}}) + \text{lim}_{c \rightarrow 0} ( \sum_{j \in S \backslash S_{opt}}e^{\frac{-f(i)}{c}}) \\
&= (1 + 1 + \dots + 1_{|S_{opt}|}) + (0 + 0 + \dots + 0_{|S \backslash S_{opt}|}) \\
&= |S_{opt}| + 0
\end{aligned}\]
<p>we obtain:</p>
<p><span class="marginnote">\(f_{opt} = 0\)</span></p>
\[\begin{aligned}
\text{lim}_{c \rightarrow 0} q_i(c) &= \text{lim}_{c \rightarrow 0} \frac{e^{-f(i)/c}}{\sum_{j \in S}e^{-f(j)/c}} \\
&= \text{lim}_{c \rightarrow 0} \frac{e^{(f_{opt}-f(i))/c}}{\sum_{j \in S}e^{(f_{opt}-f(j))/c}} \\
\end{aligned}\]
<p><span class="marginnote">\(\chi(S)(i) \rightarrow 1\) if \(i \in S\), 0 otherwise</span></p>
\[\begin{aligned}
&= \text{lim}_{c \rightarrow 0} \frac{e^{\frac{(f_{opt}-f_{opt})}{c}}}{\sum_{j \in S}e^{\frac{(f_{opt}-f(j))}{c}}}\chi (S_{opt})(i) \\
&+\text{lim}_{c \rightarrow 0} \frac{e^{\frac{(f_{opt}-f(i))}{c}}}{\sum_{j \in S}e^{\frac{(f_{opt}-f(j))}{c}}} \chi (S \backslash S_{opt})(i) \\
&= \frac{1}{|S_{opt}|} \chi (S_{opt})(i) + 0
\end{aligned}\]
<p><span class="marginnote">\(S \backslash S_{opt}\) – all non-optimal states</span></p>
<p>Thus given conjecture 2.1 and corollary 2.1, we can say that simulated annealing reaches a uniform distribution of optimal solutions as temperature \(c\) goes to 0 and number of steps taken at every temperature point approaches infinity:</p>
\[\text{lim}_{c \rightarrow 0} \text{lim}_{k \rightarrow \infty} P_c\{X(k) = i\} = \text{lim}_{c \rightarrow 0} q_i(c) = q^*_i\]
<p>Thus simulated annealing finds an optimal solution if an infinate amount of transitions is allowed. Practically, that is not possible.</p>
<h3 id="can-we-approximate-arbitrarily-closely">Can we approximate arbitrarily closely?</h3>
<p>In the previous section it was shown that if one would generate an infinate homogenous Markov chain it would eventually converge to a stationary distribution. From that point we could decrease temperature an repeat the process again, which would guarantee to eventually reach a uniform stationary distribution of only optimal solutions. However, in practice generate an infinate amount of transitions at every temperature is impossible.</p>
<p>Instead of an infinate amount of transitions, we could generate a finate sequence of transtions at ever temperature, combine that sequence into one inhomogenous sequence and see whether it could approximate the $q^*$ arbitrarily closely.</p>
<p><strong>Definition 3.11</strong> Let \(c_{l}^{'}\) denote the value of the control parameter of the $l^{th}$ homogenous Markov chain, \(L\) denote the length of the homogeneous Markov chains, and \(c_k\) denote the value of the control parameter at the $k^{th}$ trial. Then we define the sequence \(\{c_k\}\), \(k=1,\dots\) as follows:</p>
\[c_k = c_l^{'}, ~ lL < k \leq (l+1)L\]
<p><em>NOTE: I am not sure about the “k” parameter, from the above it seems like it would be in the range of <code class="language-plaintext highlighter-rouge">(lL, (l+1)L]</code>. In the book it further mentions that “[it] is taken to be piecewise constant”. So it doesn’t change when sampling at a certain temperature?</em></p>
<p>Our goal is to generate such inhomogeneus Markov chain that it would approximate \(q^*\) to an arbitrary degree:</p>
\[\left \| a(k) - q^* \right \| < \epsilon\]
<p>where \(a_i(k) = P(X(k) = i)\) (probablity disribution of outcomes at the k-th trial).</p>
<p>According to <strong>Theorem 3.6</strong>, if the temperature sequence \(\{ c_l^{'} \}\) satisfies the inequality below, then the Markov chain converges to \(q^{*}\):</p>
\[c_l^{'} \geq \frac{(L+1)\delta}{log(l+2)}, ~ l = 0, 1,\dots,\]
<p>where
\(\delta = \text{max}_{i,j \in S} \{ f(j) - f(i) | j \in S_i \}\) (maximum cost difference over all neighborhoods) and $L$ is chosen as the maximum of the minimum number of transitions required to reach an \(i_{opt}\) from \(j\), for all \(j \in S\).</p>
<p><em>NOTE: How is the \(\delta\) computed? Do you have to find a maximum diff of all the states or just the immediate neighborhood of current state?</em></p>
<p><span class="marginnote">L – minimum amount of permutations in order to get to the goal state from furthest state</span></p>
<p>We can estimate the complexity of the transitions \(k\) required for a problem based on the following equation:</p>
\[k = O((\frac{1}{\epsilon})^{\frac{1}{\text{min(a,b)}}})\]
<p>where:</p>
\[a = \frac{1}{(L+1)\Theta^{L+1}} ~, \text{and} ~ b = \frac{\hat{f} - f_{opt}}{(L+1)\delta}\]
<p>with \(\hat{f} = \text{min}_{i \in S \backslash S_{opt}} f(i)\) (minimum cost of all non-optimal solutions)</p>
<h3 id="example-tsp-problem">Example: TSP problem</h3>
<p>Let’s assume we have a 2-change neighborhood structure. Then we have \(L = n - 2\) (min number of permutations of any route in order to get the most optimal route), where $n$ is a number of cities, and $\Theta = (n-1)(n-2)$ (size of the neighborhood – $|S_i|$).</p>
<p>For \(a\) we have:</p>
\[\begin{aligned}
a &= \frac{1}{(L+1)\Theta^{L+1}} \\
&= \frac{1}{(n-2+1)((n-1)(n-2))^{n-2+1}} \\
&= \frac{1}{(n-1)}\left ( \frac{1}{(n-1)(n-2)} \right )^{n-1}
\end{aligned}\]
<p>and \(b\):</p>
<p><span class="marginnote">$\frac{\hat{f} - f_{opt}}{\text{max}_{i,j \in S} { f(j) - f(i) | j \in S_i }} < 0$</span></p>
\[\begin{aligned}
b &= \frac{\hat{f} - f_{opt}}{(L+1)\delta} \\
&= \frac{\hat{f} - f_{opt}}{(n-1)\delta} < \frac{1}{n-1}
\end{aligned}\]
<p>In the book $a \ll b$, the only way this is true is if:</p>
\[\left ( \frac{1}{(n-1)(n-2)} \right )^{n-1} \ll \frac{\hat{f} - f_{opt}}{\text{max}_{i,j \in S} \{ f(j) - f(i) | j \in S_i \}}\]
<p><em>NOTE: don’t have an intuition of why this could be the case. The easiest thing would be to play around this empirically.</em></p>
<p>By choosing $\frac{1}{\epsilon} = n$ (the epsilon is chosen here this way so that the transition amount complexiities could be compared as functions of n, but it could be any other value) we obtain:</p>
\[k = O(n^{n^{2n-1}})\]
<p>whereas \(\|S\| = O((n-1)!)\):</p>
<p><img src="https://marti-1.s3.amazonaws.com/notes/simulated_annealing/k_vs_total_states.png" alt="image" /></p>
<p>The plot for \(k\) terminates at n=2, because afterwards it just shoots to infinity. Thus it is far more efficient to just enumerate all of the states, then solve it using simulated annealing.</p>
<h3 id="example-n-queens">Example: n-queens</h3>
<p>TBA</p>
<h2 id="finate-time-approximation">Finate-Time Approximation</h2>
<p>Trying to approximate asymptotic convergence of simulated annealing towards a distribution of optimal solutions might require a number of transitions that is bigger than the state space itself. Resulting for most problems in exponential time complexity. An alternative is to relax the optimality and instead aim for solutions that are in between local and global optimas of state space. Instead of trying to achieve stationary distribution at ever temperature value, we could instead aim for some quasi stationary distribution:</p>
\[\left \| \mathbf{a}(L_k,c_k) - \mathbf{q}(c_k) \right \| < \epsilon ~ (4.1)\]
<p>where \(L_k\) denotes the length of the \(k^{th}\) Markov chain. Then <em>quasi equilibrium</em> is achieved if \(\mathbf{a}(L_k, c_k)\) is “sufficiently close” to $\mathbf{q_k(c_k)}$, the stationary distribution at $c_k$ for some specified positive value of $\epsilon$.</p>
<p>The quasi equilibrium requires careful design of a cooling schedule. The idea behind the cooling schedules is to start with some $c_0$ that would be guaranteed to generate a stationary distribution (e.g. uniform over all states) and then decrease $c_k$ with small decrements and to generate a fixed length Markov chain that would restore the quasi equilibrium at the end. Finally, a <em>final value</em> of $c_k$ has to be specified in order for the search to terminate. To summarize, the cooling schedule consists of:</p>
<ul>
<li>an initial value – \(c_0\);</li>
<li>a decrement function of \(c_k\);</li>
<li>a final value of \(c_k\);</li>
<li>\(L_k\) – a finate number of transitions at each value of the temp parameter.</li>
</ul>
<p>An example cooling schedule proposed by Kirkpatrick, Gelatt & Vecchi (<a href="http://www2.stat.duke.edu/~scs/Courses/Stat376/Papers/TemperAnneal/KirkpatrickAnnealScience1983.pdf">1983?</a>):</p>
<h3 id="initial-value-c_0">Initial value $c_0$</h3>
<p>You start with a small \(c_0\) and keep increasing by a constant factor > 1, until the probability computed from the samples is close to 1. Note, the <em>acceptance ratio</em> is defined as:</p>
\[\chi(c) = \frac{\text{number of accepted transitions}}{\text{number of proposed transitions}}\]
<h3 id="decrement-function">Decrement function</h3>
<p>You can have either small \(\triangle c\) or large \(L_k\). In practive, small \(\triangle c\) is favoured. Frequently used decrement function is given by:</p>
\[c_{k+1} = \alpha c_k, ~ k = 1,2,\dots\]
<p>where \(\alpha\) is a constant smaller than but close to 1. Typically between 0.8 and 0.99.</p>
<h3 id="final-value">Final value</h3>
<p>Execution of the algorithm is terminated if the value of the cost function of the solution obtained in the last trial of a Markov chain remains unchanged for a number of consecutive chains.</p>
<h3 id="length-of-markov-chain">Length of Markov chain</h3>
<p>The number of transitions needed to achieve a quasi equilibrium comes from an intuitive argument that quasi equilibrium will be restored after acceptance of at least some fixed number of transitions. However, since transitions are accepted with decreasing probability, one could obtain \(L_k \rightarrow \infty\) for \(c_k \rightarrow 0\). Thus it needs to be bounded by some constant \(\bar{L}\).</p>
<h2 id="resources">Resources</h2>
<ul>
<li>Simulated annealing and Boltzmann machines: a stochastic approach to combinatorial optimization and neural computing</li>
</ul>
<hr />
<p><strong>Update 2022-02-14</strong>: <a href="https://marti-1.s3.amazonaws.com/notes/simulated_annealing/simulated_annealing.tar.gz">MATLAB code</a> for 8-queens problem using Kirkpatrick, Gelatt & Vecchi cooling schedule.</p>
<hr />
<p><strong>Update 2022-02-17</strong>: A Polynomial-Time Cooling Schedule</p>
<p><strong>Postulate 2.1</strong>. Let \(R_1\) and \(R_2\) be two regions of the value of the cost function, where \(R_1\) denotes the region of a few standard deviations $\sigma_{\infty}$ around $\left \langle f \right \rangle_{\infty}$ and $R_2$ the region close to $f_{min}$. Then, for a typical combinatorial optimization problem, \(\omega(f)\) is given by a normal distribution $\omega_{N}(f)$ in the region $R_1$ and by an exponential distribution \(\omega_{\epsilon}(f)\) in the region \(R_2\). Furthermore, we conjecture that the number of solutions in $R_1$ is much larger than the number of solutions in \(R_2\).</p>
<p>Here:</p>
<ul>
<li>\(\omega(f)\) is a density funciton of fitness values;</li>
<li>
\[\text{lim}_{c \rightarrow \infty} \left \langle f \right \rangle_c \triangleq \left \langle f \right \rangle\_{\infty} = \frac{1}{\|S\|}\sum\_{i \in S} f(i)\]
</li>
<li>
\[\text{lim}_{c \rightarrow \infty} \sigma^2_c \triangleq \sigma^2\_{\infty} = \frac{1}{\|S\|}\sum\_{i \in S} (f(i) - \left \langle f \right \rangle\_{\infty})^2\]
</li>
</ul>
<p>Figure below depicts how the transition from \(R_1\) to \(R_2\) happens as fitness approaches \(f_{min}\). The blue bars show the density of fitnesses of states explored at \(c_k\), the red line depicts normal distribution with \(\sigma = \sigma\_{\infty}\) and \(\mu = \left \langle f \right \rangle_{\infty}\), the gold line depicts an exponential function defined as \(e^{-f\gamma}\) where \(f\) is all possible fitness values and \(gamma\) is computed at \(\frac{1}{2c_k}\) in order to satisfy the \(0 < \gamma < c^{-1}\) constraint (<em>NOTE</em> this comes straight from the book, not sure why \(\gamma\) should be in this range, but for depiction purposes it doesn’t matter too much since all we care is to show that an exponential distribution is approached the closer we get to the optimal solution).</p>
<p><img src="https://marti-1.s3.amazonaws.com/notes/simulated_annealing/postulate2_1.gif" alt="image" /></p>
<h3 id="initial-value-c_0-1">Initial value \(c_0\)</h3>
<p>The acceptance ratio is defined as:</p>
\[\chi(c) = \frac{\text{accepted}}{\text{proposed}}\]
<p>All proposals that satisfy \(f(i) >= f(j)\) are accepted, let’s call them \(m_1\). The rest (\(f(i) < f(j)\)) defined as $m_2$ are accepted with probability $e^{-\frac{\Delta \bar{f}^+}{c}}$, where $\Delta \bar{f}^+$ is average cost difference of proposals with higher then current solution cost. Thus we can approximate the acceptance ratio by:</p>
\[\chi(c) = \frac{m_1 + m_2*e^{-\frac{\Delta \bar{f}^+}{c}}}{m_1 + m_2}\]
<p>from the above we can express $c$ as:</p>
\[c = \frac{\Delta \bar{f}^+}{ln(\frac{m_2}{m_2*\chi_0 - m_1(1-\chi_0)})}\]
<p>where $\chi_0$ is a hyperparamter of preferred acceptance ratio.</p>
<p>The $c_0$ is calculated in the following way:</p>
<ol>
<li>$c_0 = 0$</li>
<li>generate a sequence of transitions</li>
<li>compute new $c_0$ from the equation above</li>
<li>if not converged go to step 2</li>
</ol>
<p>Apparently you can converge fast to the final $c_0$ value this way.</p>
<h3 id="decrement-function-1">Decrement function</h3>
<p>First we make an assumpition that if $q(c_{k+1})$ does not deviate far from the $q(c_k)$ then quasi equilibrium is maintained, assuming it was at $c_0$. This can be quantified as:</p>
\[\forall i \in S: \frac{1}{1+\delta} < \frac{q_i(c_k))}{q_i(c_{k+1})} < 1 + \delta\]
<p><em>Theorem</em>: The above inequality is satisfied if the following condition holds:</p>
\[\forall i \in S: \frac{exp(-\frac{\delta_i}{c_k})}{exp(-\frac{-\delta_i}{c_{k+1}})} < 1+\delta\]
<p>where $\delta_i = f(i) - f_{opt}$</p>
<p>The above equation can be rewritten in order to extract required inequality between $c_k$ and $c_{k+1}$:</p>
\[\forall i \in S: c_{k+1} > \frac{c_k}{1+\frac{c_k ln(1+\delta)}{f(i) - f_{opt}}}\]
<p>The Postulate 2.1 we can simplify the above expression by using a smaller set of states:</p>
\[S_{c_k} = \{i \in S | f(i) - f_{opt} \leq \left \langle f \right \rangle_{c_k} - f_{opt} + 3\sigma_{c_k} \}\]
<p>The $S_{c_k}$ set includes states with costs that are up to 3 standard deviations away from average. Considering that during the cooling cost density function evolves from a normal to exponential, we can guarantee to capture 99% and 95% of states respectively.</p>
<p>The $f(i) - f_{opt} \leq \left \langle f \right \rangle_{c_k} - f_{opt} + 3\sigma_{c_k}$ acts as an upper bound of all $f(i) - f_{opt}$, thus if the later is replaced by the former we get a stronger inequality condition:</p>
<p><span class="marginnote">The inequality condition is stronger because there is a stronger guarantee that $c_{k+1}$ is going to be larger than $c_k$</span></p>
\[c_{k+1} > \frac{c_k}{1+\frac{c_k ln(1+\delta)}{f(i) - f_{opt} + 3\sigma_{c_k}}} > \frac{c_k}{1+\frac{c_k ln(1+\delta)}{f(i) - f_{opt}}}\]
<p>Getting $f_{opt}$ for most of the combinatorial problems is not possible. However, since $\left \langle f \right \rangle_{c_k}$ and $3\sigma_{c_k}$ co-varies, we approximately have a function that is a scaled version of one of those variables, thus we can ommit $\left \langle f \right \rangle_{c_k} - f_{opt}$, and rescale the fraction by using smaller $\delta$ values:</p>
\[c_{k+1} = \frac{c_k}{1+\frac{c_k ln(1+\delta)}{3\sigma_{c_k}}}\]
<p><em>NOTE:</em> How is $\delta$ selected? Seems like something that needs to be eye-balled a priori.</p>
<h3 id="stopping-condition">Stopping condition</h3>
<p>The algorithm should terminate when $\Delta \left \langle f \right \rangle_{c_k}$ (gradient of average cost at temperateure $c_k$) is “sufficiently” small with respect to $\left \langle f \right \rangle_{c_0}$:</p>
\[\frac{\Delta \left \langle f \right \rangle_{c_k}}{\left \langle f \right \rangle_{c_0}} < \epsilon_s\]
<p>For suffiently large values of $c_0$, we have $\left \langle f \right \rangle_{c_0} \approx \left \langle f \right \rangle_{\infty}$. Also, for $c_k \ll 1$ (which is true always towards the end):</p>
\[\Delta \left \langle f \right \rangle_{c_k} \approx c_k \frac{\delta \left \langle f \right \rangle_{c_k}}{\delta c_k}\]
<p>which gives:</p>
\[\frac{c_k \delta \left \langle f \right \rangle_{c_k}}{\left \langle f \right \rangle_{c_0} \delta c_k} < \epsilon_s\]
<p>In practice $\left \langle f \right \rangle_{c_k}$ fluctuates, so needs to be smoothed in order to avoid premature termination.</p>
<h3 id="length-of-markov-chain-1">Length of Markov chain</h3>
<p>We want to select $L_k$ to have such value that there would be a guarantee that most/all of the neighbours are going to be visited in case none of the proposed states are accepted. This makes sense, because we want to make sure that if after iteration $k$ we have not selected some neighbour as our new state is because there was no neighbour with lower cost.</p>
<p><span class="marginnote">
<a href="/notes/math/statistics/2022/02/19/sample.html#prob-of-selecting-an-element-from-a-set-s-in-n-samplings">Prob of selecting an element from a set S in N samplings</a>
</span></p>
<p>For large $|S_i|$ – cardinality of $i$th state and large $N$ large number of samplings, and $|S_i| = N$ the probability of visiting any of i’s neighbours can be approximated as $1 - e^{-\frac{N}{|S_i|}} = 1 - e^{-1} \approx 2/3$. The probability assumes that no solutions were accepted an therefore total of N samples were made.</p>
<p>If we take $N = 3|S_i|$, then $1 - e^{-3} \approx 1$.</p>
<p>In the literature the $L_k$ ($L_k = N$) is adviced to range from $|S_i|$ to $3|S_i|$.</p>
<p>This approximation seems to hold for $|S_i| > 100$.</p>
<h3 id="proof">Proof</h3>
<p>What is the probability of selecting an element from a set S in N samplings?</p>
<p>P of not selecting an element =
\(1 - \frac{1}{|S|}\)</p>
<p>P not selecting in N samplings =
\((1 - \frac{1}{|S|})^N = (\frac{|S|-1}{|S|})^N\)</p>
<p>log(P of not selecting in N) =
\(\text{log}((1 - \frac{1}{|S|})^N) = N\text{log}(1 - \frac{1}{|S|})\)</p>
<p><strong>Log approximation</strong></p>
\[\text{log}(1 + \sigma) \approx \sigma ~~ \text{for small}~ \sigma\]
<p>thus, if S is large
\(N\text{log}(1-\frac{1}{|S|}) = N(-\frac{1}{|S|}) = -\frac{N}{|S|}\)</p>
<p>P of not selecting in N =
\(\text{e}^{-\frac{N}{|S|}}\)</p>
<p>P of selecting in N =
\(1 - \text{e}^{-\frac{N}{|S|}}\)</p>Metropolis algorithm for simulating the evolution of a solid in a heat bath to thermal equilibriumMetaverse Might Just Solve Climate Change2021-12-09T00:00:00+00:002021-12-09T00:00:00+00:00/economics/2021/12/09/metaverse-might-just-solve-climate-change<p><em>NOTE. The <strong>metaverse</strong> in this article is meant as a term used to describe a social 3D virtual world/economy in general and not Facebook’s recent rebranding.</em></p>
<p>For the longest I was somewhat pessimistic about how realistic the climate change fight is. First of all, there are a lot of prematurely optimistic headlines floated by the media on how a solid progress is being made. However, CO2 capture, renewable energy, pledges made during climate change conferences and so on, they all might look hopeful summarized in a title, but once looked under the hood, one quickly realizes how much work in progress they all are. Second, it is human nature not to care until the last minute, consuming and being hypocritical. Everyone is pointing fingers into the oil & gas industry and calling them the villain, because somebody needs to take the blame, it is surely not Us who are responsible, right? Nobody wants to use public transport, we are constantly gorging on cheap fashion and useless crap on Amazon, constantly just staring at our phones. We pretend to be super concerned about the environment but also need to “travel the world”, eat meat and so on. Finally and most cynical of all, the aforementioned consumerism is not a bug in our system, it is actually a feature, this is how the system is designed to operate. If we were to stop consuming, GDP would drop and we would have a recession, that would eventually turn into a depression, society would fragment, order would fall apart and we would end up with a bloodbath. In short, we find ourselves in a deadlock – if we continue to consume we are screwed, if we stop consuming we are also screwed, just faster and with a higher guarentee.</p>
<p>Seems like a dead end? Maybe not. Lockdowns have shown that a <a href="https://www.nature.com/articles/d41586-021-00090-3">-6.4%</a> change in CO2 emissions compared to 2019 can be actually achieved at any time just if we stayed at home. We have actually for the first time come very close to the <a href="https://www.nature.com/articles/d41586-021-00090-3">-7.6%</a> emissions target set by the 2015 Paris climate agreement. The drop in emissions was mostly contributed by reductions in transportation: aviation decreased by 75%, surface transport by 50%. Transportation is the biggest contributor to our GHG emissions. It is also one that is the most redundant going forward in our society connected by the internet.</p>
<p>2020-21 Covid-19 experience might be both a hint to a probable solution and a peek into where we are headed. Whilst the never ending lockdown future is dystopian, it might not be forced, but actually naturally transitioned into. Lockdowns have showed us that working, studying and socializing from home is viable even with such rudimentary technologies like Microsoft Teams, Zoom or Twitch.</p>
<p>Improvements in VR and development of the metaverse, might evolve virtual life from a crude substitute to a strong alternative or even main attraction. Most importantly, the creation of metaverse would solve the consumption part of the equation beyond transportation. Fashion industry is estimated to contribute <a href="https://www.nature.com/articles/s41558-017-0058-9">5</a>-<a href="(https://quantis-intl.com/report/measuring-fashion-report/)">7%</a> of total CO2 emissions – that could just become purely virtual. There could be further reduction in demand for iron, steel and cement (another <a href="https://ourworldindata.org/emissions-by-sector">-10%</a>) if physical housing became less of an asset that it is today, especially in China, where ghost towns are being built purely for economic growth and investment purposes but not for living. Obviously, all of the mentioned CO2 reductions wouldn’t happen overnight nor reach an absolute zero. Certain amount of housing would still be needed, so would clothing, seldom haircut and so on.</p>
<p>Another interesting pro to this approach is countries like China and India would be forced to join this growing new economy or risk being left behind. Again, no need to ask for pointless pledges and promises – the West stops consuming cheap physical goods and transitions into cheap virtual goods consumption, producer economies will be forced to adjust or enter recession.</p>
<p>The only question is how far the VR tech can get us and how fast to keep us happy at home?</p>NOTE. The metaverse in this article is meant as a term used to describe a social 3D virtual world/economy in general and not Facebook’s recent rebranding.