What is the best data structure/implementation for representing a time series?

  • I was wondering what is best practice for representing elements in a time series, especially with large amounts of data. The focus/context is in a back testing engine and comparing multiple series.

    It seems there are two options:

    1) Using an integer index, or
    2) Using a date-based index

    At the moment I am using dates, but this impacts on performance & memory usage in that I am using a hash table rather than an array, and it requires some overhead in iteration (either forward or backwards) as I have to determine the next/previous valid date before I can access it.

    However, it does let me aggregate data on the fly (e.g. building the ohlc for the previous week when looking at daily bars) and most importantly for me allows me to compare different series with certainty I am looking at the same date/time. If I am looking at an equity issue relative to a broader index, and say the broader index is missing a few bars for whatever reason, using an integer indexed array would mean I'm looking at future data for the broad index vs present data for the given security. I don't see how you could handle these situations unless you're using date/times.

    Using integer indexes would be a lot easier code wise, so I was just wondering what others are doing or if there is best practice with this.

    What programming language are you using?

    I just do what R does with `POSIXct`: fractional seconds since the epoch. Millisecond resolution ... and it easily interfaces with POSIX time semantices in other systems.

    R also have class libraries such as XTS which you can use. They also allow you to select subsample quite easily. But, as @chrisaycock said, it pretty much depends on what technology you're using, and how large is your sample.

    Just use pandas. Or if you're intent on going your own route, copy pandas. At its most basic it's just parallel arrays of timestamps and data but there're a lot of edge cases. http://pandas.pydata.org/pandas-docs/dev/timeseries.html

  • wburzyns

    wburzyns Correct answer

    10 years ago

    Representing time series (esp. tick data) using elaborate data structures may be not the best idea.

    You may want to try to use two arrays of the same length to store your time series. The first array stores values (e.g. price) and the second array stores time. Note that the second series is monotonically increasing (or at least non-decreasing), i.e. it's sorted. This property enables you to search it using the binary search algorithm. Once you get an index of a time of interest in the second array you also have the index of the relevant entry in the first array. If you wrap the two arrays and the search algorithm e.g. in a class you will have the whole implementation complexity hidden behind a simple interface.

    Arrays are also cache friendly which is a big advantage on modern CPU where a lot of cache misses can be critical for performance.

    Additionally, if you have some data in mostly regular intervals (e.g. daily) you can use estimates of the actual position (knowing start/end dates and frequency) to further beef up performance for even more juice.

    I disagree with this solution; it produces human errors. what if you miss one entry on one of the arrays? all your data is out of wack. Best is to store the Date,OHLC in rows; and then in one array.

    I second alpha's comment, the last think you want to do is have a massive vector once you get any kind of data in your timeseries.

    @alpha This is the correct datastructure, although I agree with you that one needs to put either a massive amount of testing into it or steal someone else's implementation. Pandas's series are essentially what's described.

License under CC-BY-SA with attribution

Content dated before 7/24/2021 11:53 AM