Correlation of Data Series. A Scientific Study on the Selection of Meaningful Variables and Functions for the Separation of Trends, Cyclic Parts and Scatter from Data Series
Wissenschaftliche Studie 2015 79 Seiten
Table of Content
On the Correlation of Data Series... 5
1 Selection of Meaningful Functions and Variables for the Analysis of Data Series... 6
2 Examples for Separating Functions... 9
2.1 Growth Functions... 9
2.2 Other Methods like so called Filters... 13
2.3 Useful functions for the steady part of the data series... 14
2.4 Cyclic Components... 15
2.5 Elimination of Scattered Points (Noise)... 15
2.6 Overview on Separating Functions... 16
2.6.1 Independent and additive trends and fluctuations... 16
2.6.2 Dependent and multiplicative growth functions... 17
2.6.3 Importance of Predictability... 18
3 Examples for the proposed Method... 23
3.1 Example One: World population time series... 23
3.2 Example Two: Dow Jones Industrial Index... 29
3.3 Example Three: World Climate Change... 39
3.3.1 Description of Climate Separating Functions... 39
3.3.2 Comparison of three different CRU time series... 43
3.3.3 Attention... 64
4 Conclusion.. 65
5 Literature and Source of Examples... 67
6 Some Abbreviations and Nomenclature used... 70
7 List of Figures... 72
Originally, I wrote this essay for readers interested in the correlation of data series, supplemented by examples for the explanation of the proposed methods. I have shown several mathematical models and numerous figures, graphics and charts for better comprehension.
During the preview of this essay the selected examples turned out to be so exciting, that a short introduction into the examples themselves would be advisable, also to readers not interested in mathematics and modeling:
The increasing number of annual airport passengers was a good example for the fundamental question whether time or population causes this increase.
For the rise of the atmospheric carbon dioxide content over long periods in between the present interglacial, I found an unrivaled variable for the correlation.
For the growth of the world population from origin of Homo sapiens up to its growth limit, I compared two different models, one time-based, the other one population-based.
For the long-term growth of Dow Jones Industrial Index, I found a good correlation with world population and time, while the long term volatility correlated fairly with time.
The long-term climate change correlates with world population, whereas the long natural cyclic climate observations correlate with time, and the combination of both lead to pioneering innovative visualizations, again within the present interglacial.
Readers interested in these five examples may focus on the describing part and examine the illustrations first. It could be tedious to read everything from start to finish. Therefore, the reader may even take a look at chapters like Literature or Contents. The chapters Attention and Conclusion will point amongst other things to limits of extrapolations.
The chapter regarding the climate change could be boringly long, since after a describing section, I show three examples of temperature anomalies from air/land, air plus sea and sea surface, all in the same manner for comparison purposes. Here the reader may have a glance at the figures and read the concluding comparison section with the last two figures.
The combined curves separated from the temperature anomalies show periods of about 30 to 40 years, where the climate change seems to take a break, a so called hiatus, followed by a steep increase of nearly the same duration. The first period happens when the upward trend meets the downward section of a natural cycle. During the second period, long-term trend and upward part of the wave intensify each other. Presently we live close to the middle of a hiatus. This phenomenon is controversially discussed in the news. For instance, one may read the false argument that the climate change has stopped in spite of constantly rising atmospheric CO₂. Actually, the underlying long term trend does not stop, even if it is not directly visible in the present scattered measurements. This can be disclosed for example by separating functions as used in chapter 3.3 of this study. 1)
Another subject is the philosophy behind the proposed methods.
Generally, one may know that it is often difficult to find a direct cause for a result, especially when the result depends not only on one cause; moreover, if different causes act in opposite directions. Scientists work hard to trace back the result to these different causes by physical laws. (In fact, this is required for political conclusions.)
Greek philosophers already discussed this problem in early times and found this solution: “The cause of the cause is as well a cause of the result”. We accept this hypothesis in soft science, but not generally in physics. If correlation methods belong to soft science, then we can use this wisdom for our goal. In addition, when we find a common cause for all the other partly conflicting causes, then we can try a correlation of the observations with this common cause. Of course, a good correlation does not replace exact physical laws. Nevertheless, as long as scientists have not yet found such laws, we help ourselves with powerful correlations.
Furthermore, since we collect data and list it with time as the leading category, then we consistently find simple trend correlations in publications and news, mostly linear, seemingly with time as the “causing” variable. Nobody cares, it’s just a trend: upward, sideward or downward. Some scientists also like to show their results vaguely or use scenarios in order to impress the readers with their objectivity. Therefore, the most probable possibilty is often hidden in a cloud of assumed ones, so that nobody can blame the author if the projections don`t come true.
If I seek for good correlations, then I have to look for two things: firstly, I need the real causes or the common cause for the causes and secondly, I need a feasible model. Only then can I use this correlation for looking into the past and into the future. Nevertheless, I must point out that things may not happen as predicted, but that there is certain probability. It is not necessary to hide the most probable correlation.
One word I should address to teachers and generations of poor pupils who were forced to solve the task to correlate world population data as function of time by using the simple logistic function of Verhulst, 1833. They either found that the model does not fit the data or they concluded that the model can only describe short periods of time. One of the pupils ended up with a large paper using the chaos theory. – Scarcely someone brakes out and tries something new. Maybe, the chapter about world population will help to solve this problem.
In this sense, the reader may enjoy this essay, even if not interested in mathematics.
Please note , that any model extension out of the range of measurements will have a conditional probability and cannot be taken as fact, since it is based on correlations, not on laws. However, the extensions have been shown to understand the character of the model and not as a prediction. Sometimes only the extensions will show which model makes sense and which does not. On top of that, the extrapolations only make sense within periods of constant conditions.
May 2015, Hans-Martin Stoenner
1) New measurements by Sang-Ki Lee et al, Miami (2015), show where the energy could have been stored during the hiatus.  – From there the energy would probably be released again after some time. That could possibly explain the origin of the cycles of about 65 to 75 years duration separated from the data series in chapter 3.3.
1 Selection of Meaningful Functions and Variables for the Analysis of Data Series
There are many types of time series, such as financial, demographical, economical, climatic, meteorological, environmental, measurements concerning atmospheric data and other types.
This paper deals specifically with the correlation of data on world population, long term measurements of world wide temperatures and also on long term changes of indices like the Dow Jones Industrial Index. These three time series serve as examples, but the findings can be applied generally.
Well known is the fact that some time series show a more or less perplexing inconsistency or anomaly, while others follow an astonishing regularity. Some of them seem to follow a long term trend superimposed by short term variations. Some others again seem to come and go like waves or other cyclic curves. Finally, some time series seem to be randomly distributed.
The purpose of this essay is to propose a method on how to separate long term steady trends from long term waves or random effects. What one needs first is a good, consistent, well-founded and reasonable function for the steady part.
When this part is separated from the time series, one may possibly find in the residue a wave-like long term effect, which may again be separated using an undulating or sinuous cyclic function. The final residue may be scattered (or randomly distributed), and one will have to find an explanation for this remaining unsteadiness. Possibly the source of this may be unsteady itself, so that the effect can be correlated to those unsteady time series, showing again some sort of relationship excluding the original time-scale. Series with or without the time category may generally be called “DATA SERIES”. Two short examples may explain the possibilities of analyses of extended data series: first let us take a look at the development of airport passengers per year. The published figures for Frankfurt Airport contain the operating year and the number of passengers (NP) in that year.
Seemingly, the trend for the given time period can be approximated by a polynomial function of time: see Fig. 1-1.
[Figures and tables are omitted from this preview.]
The polynomial trend (time) has a correlation coefficient of R^2=0,979. However, when extrapolated into the future this unlimited trend would result in unreasonably high values already within a short period. Therefore it is very important to look at the extrapolation of the functions to check whether they are feasible and meaningful. 
In order to improve the correlation we may assume that the trend could have something to do with the growth of the world population “y”. To check this, we add a third category showing the world population figures “y” in those operating years: Fig. 1-1.
Then we try some correlating functions NP=f(y) which may be linear, polynomial or potential. Since the data starts with the opening of the airport we could for instance use a function like NP(y) = d + b*(f(y))^a. For f(y) we set (y-c), so that NP=0 when y=c and d=0. The polynomial and the potential results are shown in Fig. 1-2.
Both, the polynomial as well as the potential correlation as f (y) are equally good (R^2~1), only the extrapolations show a slight difference.
The polynomial trend as a function of y (world population) of the scattered raw data results in a correlation coefficient of R^2=0,986, shown in Fig. 1-2.
This is better than that of the polynomial trend as a function of t (time): R^2=0,979, shown in Fig. 1-1.
And, even more important, the extrapolated figures are limited and look feasible in the correlation f(y). For instance, the number of passengers in 2100 is: 250 million in case of f(t), but only 100 million in case of f(y), see next Fig. 1-2.
[Figures and tables are omitted from this preview.]
Remaining fluctuations could be interpreted as passengers having postponed their flight to the following years or the reverse so that in the average a steady trend results. This average is reflected by the best-fit or compensating curve. The x-axis could now be replaced by the time axis in order to obtain a familiar view. For that we need a good relationship between time (t) and world population (y). The function t=f(y) will be handled in the next chapter 2, but for now let us have a general look at the category time (t) itself:
Time series are shown with the date or time as the leading category, which some people use as the basis for correlations, mostly for linear trends. But most people hardly ever question time (t) to be the cause, reason or source of the observed and conserved reports y = f (t). The effects may in some cases correlate to t, like the pendulum of the clock, but the cause of its movement is not the time, the causes are the gravity and the energy supplied by a spring or a heavy weight. – What is time, physically seen? It is the “forth dimension” which may be defined by the regular return of the equinoctial points every half year or by the oscillation of an atomic clock, and could be defined by length and relative velocity. Time may be a result, when one correctly asks: “How long will it take for this or that to happen?” In time series time is nothing else than a means for listing something in a certain order. So, time has to be replaced or complemented in any case, when looking for reasons or only for correlations of trends. Otherwise the reader could believe the time is the cause. No,the consumed time is the result of an action or of a sequence. Keeping this in mind, one will have the key to detect an erroneous approach or ansatz. But there is no rule without exception: in case of cyclic functions the waves may correlate well with time, even if time is not the cause.. Another exception is the growth of savings accounts due to the compound interest effect.
The separating functions for the analysis of time series therefore have to be selected or developed very carefully.
2 Examples for Separating Functions
2.1 Growth Functions
Let us start with a long term more or less steady time series such as the human world population. The earth is a closed system for the human beings, small exceptions neglected. But it is open for the input of sun radiation energy and open for the loss of energy to the space. This energy flow  creates life on this earth, starts wind and evaporates water forming clouds and rain. The O2-CO2-recycle between the animal world and the flora is nowadays a relative steady process as a basis for life, driven by sun energy. In this paradise the human world population is localized. But human beings have a different nature compared to other beings, living more in equilibrium with the environment. This nature allows the human species to overcome limits, until all unknown limits are exhausted or the closed system is filled up to a maximum the earth can still afford. What could a sigmoid or logistic formula look like, which describes the growth and the limit of the world population? The famous Russian physicist S. Kapitza (English: Kapitsa), member of the Club of Rome, has made a proposal to describe the growth per year of the world population (N, people) . Kapitza’s formula reads with his own nomenclature:
dN/dt = C / ((T-t)^2+τ^2)
containing the parameters C, T and τ. This is an analogy to the mathematical ansatz
dy/dx = 1/((1-x)^2+1^2)
which integrates to y = -ARCTAN(1-x) + constant. If one plots dy/dx over x one gets a bell-shaped curve, which is somewhat similar to the growth-function of the world-population dN/dt. Defining y=N and dx=C*dt and replacing (1-x)^2 by (T-t)^2 and 1^2 by τ^2, we find Kapitza’s ansatz dN/dt=C/((T-t)^2+τ^2), which integrates to
N = No-C/τ*ARCTAN((T-t)/τ)
with the four parameters No, C, T and τ. This approach is astonishingly good and is capable to describe the world-population curve over a few centuries. But it requires a very high starting point of about 400 million people in order to reflect the recent world population figures between 1804 and 2011 (or between 1 and 7 billion people). What is the error of this approach? Remember, what was said in chapter 1: “No, the consumed time is the result of an action or of a sequence. Keeping this in mind, one will have the key to detect an erroneous approach or ansatz.” Kapitza’s approach reads simplified
ý = f (t).
The reason for the growth seems to be time or date. But, our calendar has nothing to do with the cause of the growth. We can only ask “How long will it take for the population to grow from 1 to 7 billion?” Therefore this approach is principally erroneous.
What do growth-functions really look like?
ý = f(y)
The most simple growth function is ý=y. This means: something must exist (y), before it can grow (ý). The growth ý is proportional to the growing subject y. The higher this y is, the faster does it grow (ý).
For integration we separate the variables: dy/y = dx and with dx = k*dt we obtain dy/y = k*dt. This integrates to:
t = to + (1/k)*LN(y/yo)
In this form “t” is the result. But most people like the de-logarithmized form, which in this simple case can still easily be obtained:
y = yo*EXP (k*(t-to))
Mathematically this exponential form means the same as before, only the order of calculation has been changed. But now some people believe time is the driving force for this exponential growth. This leads to worldwide confusion and to political demands, the leaders should stop the unlimited exponential growth. (Nobody says how that should be done. Actually it is impossible.)
- ISBN (eBook)
- ISBN (Buch)
- 1.3 MB
- Mathematics Correlation Sample Applications Airport Passengers atmospheric carbon dioxide content growth of the world population from origin of Homo sapiens two different models long-term growth of Dow Jones Industrial Index long-term climate change philosophy behind the proposed methods The cause of the cause is as well a cause of the result the reader may enjoy this essay