Time-series similarity measures

  • Suppose I have two time series $X$ and $Y$ of stock prices. How do I measure the "similarity" of $X$ and $Y$?

    (I'm being deliberately vague as I don't have a particular application, and I'm curious about different approaches in general. But I guess you can imagine that there's some stock x that I don't want to trade directly, for whatever reason, so I want to find a similar stock y to trade in its place.)

    One method is to take a Pearson or Spearman correlation. To avoid problems of spurious correlation (since the price series likely contain trends), I should take these correlations on the differenced or returns series (which should be more stationary).

    What are other similarity methods and their pros/cons?

    If one of the answers were helpful please accept it - Thank you!

  • One of my favorites is a generalization of correlation: Distance Correlation (dCor)

    There are several reasons for that:

    1. It generalizes classical (i.e. linear) correlation in the sense that linearity is a special case. It gives identical readings for linear dependence.
    2. There are analogs for variance, covariance and standard deviation, so these identities hold: $$\operatorname{dVar}^2_n(X) := \operatorname{dCov}^2_n(X,X)$$ and $$\operatorname{dCor}(X,Y) = \frac{\operatorname{dCov}(X,Y)}{\sqrt{\operatorname{dVar}(X)\,\operatorname{dVar}(Y)}}$$
    3. $dCor=0$ implies true independence, all other readings imply linear or non-linear dependence - Compare the following readings, first linear correlation (source):

    enter image description here

    and distance correlation (source):

    enter image description here

    Beware, oversimplification ahead: The reason it shows this behavior is basically that it is the correlation of the characteristic functions of the random variables, i.e. the Fourier transforms of the probability density functions, i.e. a rotation from the time into the frequency domain. Therefore not only linear dependence is being tested but basically all functional dependencies which can be represented by the (periodic) complex exponential function. To get an intuition read also this article: Here.

    There is an implementation in R.

    Do you know any good reference where this concept is applied to financial data? Furthermore the "neustats.com" link does not work (at least for me). Thanks!

    @Richard: No, I don't have any examples right now, but I plan to use it on financial date myself. The article seems to have disappeared indeed. It is archived here: http://archive.is/7TVzZ

    Thanks for the link - in case you publish anything about your work on dCor I'd be happy to read more.

    @vonjd Thank you for bringing this to our attention! Distance correlation appears to be a wonderful statistical device. I have incorporated it into my weekly scan for related markets.

  • I assume you're using returns (or log returns) instead of actual stock prices. In practice, you may also want to smooth the data by using a moving average.

    There are several correlation coefficients:

    \begin{equation} r = \frac{\sigma_{x,y}}{\sigma_x \sigma_y} \end{equation}

    • Spearman's $\rho$ - uses the rank of each data set (array index if data had been sorted); less sensitive to outliers in the sample as it's non-parametric:

    \begin{equation} d_i = x_i - y_i \end{equation}

    \begin{equation} \rho = 1 - \frac{6\Sigma d_{i}^{2}}{n(n^{2}-1)} \end{equation}

    • Kendall's $\tau$ - also based on ranking, but represents the probability that the two data sets are in the same order vs. the probability that they are in different orders:

    \begin{equation} \tau = \frac{C - D}{\frac{1}{2} n(n - 1)} \end{equation}

    \begin{equation} \Gamma = \frac{C - D}{C + D} \end{equation}

    $C$ is the number of concordant pairs and $D$ is the number of discordant pairs.

  • You can look at cointegration.

    I suppose this would only work for long-term investments.

    So do I but I'm not completly sure of that. I would like to here if people think this could work for short term investement if cointegration is found on high frequency data.

    I have once done something in this line with open interest and futures, but only up to the step of saying that open interest and futures prices are cointegrated, and that some barriers of open interest might be useful as trading signal. In this regard I'd say one has not really a similarity measure if cointegration is present, but an indicator of a working mean-reversion relationship.

    @owe jessen - curious what you mean when you say futures and OI are cointegrated since although related, one measures price and the other measures quantity.

    That takes me back some time - I took the levels of the SP500, Volume SP 500 and OI of SP500-Futures, tested for cointegration and couldn't dismiss two relationships. Dito for the time-dependent (GARCH) Volatility of the SP500. I'm not shure why the different units should pose a problem.

    @Owe Jessen - Did you adjust for the recurring rise and fall of open interest around rollover dates? And yes, I guess there's no reason the different units should cast any doubt. Just wondering if this is a tradable relationship... how much of the relative movement is from volume & open interest and how much from price. This paper claims that open interest is highly predictive for currency, commodity, bond, & stock index futures so its very interesting to hear that you've found cointegration there.

    No, there was no correctorion of this kind. I didn't develop a trading simulation, so i can't say if it is really tradeable, but I also found a significant (negative) influence of first differences of open interest on the level of the SP500 when used as additional variable in the mean equation of a GARCH model. Sadly I can't publish the results in detail because this was propriatory research. I generally used the method of Floros, Christos (2007): Price and Open Interest in Greek Stock Index Futures Market. In: Journal of Emerging Market Finance, p. 191–202.

    Also, cointelation.

  • If you use a Box-Jenkins model, look at this research which uses an ARIMA framework to define clusters, and then measures the similarity of the time series via a cepstral coefficient based upon the autoregressive parameters.


  • You can use wavelet coherence, which is a measure of frequency-varying and time-varying similarity of two time series $X_t$ and $Y_t$ by comparing the coefficients of the wavelet transform $\int_{-\infty}^{\infty} f(t) \psi_{u,s}(t)dt$ (in highly non-technical terms). You can use the phase difference to study the lead-lag relationship.

    The benefit would be:

    • Doesn't require stationarity in $f(t)$.
    • Detects frequency-dependent comovement. If speculative behavior drives mid-high frequency returns and investment behavior drives low frequency returns then there is no reason why comovement at one frequency necessitates comovement at another frequency.
    • You can couple the analysis with an analysis of phase difference between the two wavelet transforms, frequency-dependent Granger causality and other tools.

    Another method would be copula correlations and copula conditional probabilities.

    • These can be time-varying through the Patton (2006) paramaterizations and more recent SCAR versions and I don't think the assumptions on the DGP is very strict.
    • You can study $\mathbb{P}(X < q | Y < q)$ which is a measure of comovement that is not offered by standard correlation measures.
    • You can get time-varying tail correlations if you are concerned with tail events.

    Also, quantile correlations:

    • Still a working paper.
    • Correlation of one variable and another, conditional on the other being in an extreme region of the distribution.

    There's also extreme correlation measures within generalised pareto distribtions. ...Not sure how this works though.

  • Start with simple things like Pearson correlation, e.g.

    enter image description here enter image description here

  • You may consider cross correlations of logarithmic returns on the two assets' prices, $\text{Corr}(x_t,y_{t-h})$, where $x_t$ and $y_t$ are the returns series and $h$ is the time lag. Cross correlation is the same as ordinary correlation but one series of returns is lagged by $h$ periods with respect to the other series. However, its use may be limited as information dissemination in liquid markets is quick and prices react to shocks without considerable delays.

  • You could use wavelet cross correlation and phase analysis coherence between the two series. By analyzing the the series at multiple frequencies you can establish if there is causality (one causing the other and such even if not direct and thus if one can be used to precict another). Cross correlation will indicate at what frequencies the two series are related.

License under CC-BY-SA with attribution

Content dated before 7/24/2021 11:53 AM