Is there any thing out there as a substitute for KDB?

  • thanks a lot for your discussions on the original post.

    following your suggestions, let me re-phrase a bit :

    kdb is known for its efficiency, and such efficiency comes at a terrible price. However, with computational power so cheap this days, there must be a sweet spot where we can achieve a comparable efficiency of data manipulation, at a more reasonable cost.

    For instance, if a KDB license cost a small shop $200K per year (I dont know how much it actually cost, do you know?), maybe there is a substitute solution: e.g., we pay 50K to build a decent cluster, storing all data onto a network file system, and parallelize all the data queries. This solution might not be as fast or as elegant as KDB, but it would be much more affordable and most important -- you take full control of it.

    what do you think? is there anything like this?

    I'm not sure what you mean. The main trick used by KDB is to store the data in columns instead of rows. This has the advantage that if one column is selected all the data can be read in one long read. This can also be done in Python.

    1) Storing each tick as a separate record/row is neither sensible nor feasible (as you already mentioned). The most common approach is to split tick data by day and store days of data as plain arrays, either in files or in database LOBs. Such approach makes the amount of data perfectly manageable. 2) possible duplicate of What is the best data structure/implementation for representing a time series?

    The minimal 2-core setup of kdb+ is actually pretty reasonable: $25K per year including maintenance.

    artic could be an option (it's free software, build on top of MongoDB), but I haven't used it, so I cannot give you an opiniion on that:

  • At discretelogics we just released a file format to store time series in flat files called "TeaFiles". In addition to raw data they can store the binary item layout and a description of the contents.

    C#, C++, Python APIs are available open source, licensed under the GPL, see

    Using memory mapping, read performance reaches that of in-memory array processing for sequential usage of a file, as is the case for back testing.

    The C# API at Codeplex holds micro benchmarks. Summing up a file with a single 8 byte double reaches 500 million operations per second on an older test machine. Using a Tick Item with int64 / double / int for Time/Price/Volume is 100 million operations.

    Disclosure: self promotion + interest in discussing time series persistence

    +1 I like the idea. How do you handle cross-sectional time series? Will we be able to embed you visualization API into a custom app?

    thank you! by cross-sectional time series you mean the analysis of several series like computing an index or stat arb, right? about embedding - our code base would allow embed-able components. we thought already about offering that. i take your question as valuable input for our product road map. at the moment our visualization tools TeaHouse/TeaShell are windows / XAML. our persistence layer is cross OS/app and we plan visualization on html5 architecture for the future. our decisions will be customer driven, so your vote counts. we are thankful for any feedback and inside into your needs.

    are there any benchmarks say this vs hdfs or other formats?

    Benchmarks about raw time series data access using TeaFiles are included in the C#/C++ sources, see A benchmark against HDFS would first need to specify what to measure. TeaFiles perform 1:1 to raw data structs in files, as they are nothing else + allow memory mapping. This is hardly to beat at first. Read ahead, prefetching allow going beyond in some situations. HDFS would be a layer above raw file storage and we have it on our roadmap to take simple file persistence to the next level. For management not for performance reasons tough.

    A little late to the party here but the link in the answer is now dead.

    link fixed. thx for the heads up!

    Awesome project! Thank you for making it open source.

  • As of April 2014, the 32-bit version of kdb+ is now free to try.

    This free version may not be used in production systems.

    The only technical limitation vs. the 64-bit version is that you can only address up to 4GB of memory per process.

    I was not aware of that, thanks a lot for this update, +1

    It's not free as in free beer. You are restricted to using it for non-commercial purposes only. It's basically a trial. You answer makes it sound like it's open source which is not true.

    Not only is it not free, but as of 2019 it is time limited. From the licence: 1.4 Disabling Features. End User understands that the Kdb+ On Demand Software contains a feature which will automatically cause the Kdb+ On Demand Software to time-out six (6) months from the date of installation of the Kdb+ On Demand Software or such other date which Kx at its discretion identifies. Kx may at its discretion from time to time agree to extend use of the Kdb+ On Demand Software for a further six (6) months or such other period which Kx at its discretion identifies.

    Kdb+ On Demand is different from the 32-bit version of kdb. On Demand provides fill 64-Bit version but requires internet connection and license that will time out.

  • You could look into Pandas, a Python library that integrates with PyTables. It was created by someone at AQR and has some similar features as KDB.

  • I don't like KDB+/q. For KDB+ experts, I am not picking a fight. The following is just my own understanding on KDB+ and TimeSeries Database. You're warmly welcome to correct me if anything wrong in your eyes :).

    First of all, during my near one year's KDB+/q development experience, I never ever find a paper based benchmark result indicating KDB+/q significantly out performs other storage system in time. By storage system, I don't limit the scope to disk based RDBMS. So I don't know why everybody in quant finance is talking about KDB+ and impressed by its efficiency without data backing their points.

    Second, I once talked with an Oracle Certificated DBA about KDB+. From my description, the first word he came up is an in-memory cache not even a full fledge database. Maybe he is right to some extent, isn't he?

    Next is the so-called TimeSeries. More than one manager advocating KDB+ from different positions and companies talked to me KDB+ is a TimeSeries Database. However, I don't think TimeSeries is a feature of KDB+, whose real feature is the column oriented storage engine. That is to say with a column oriented storage engine, that KDB+ is friendly to store TimeSeries data. While in these days, column oriented storage engine does not belong to KDB+ exclusive. For traditional RDBMS, like MySQL, you can also find corresponding column storage engines in the open source communities.

    The last and the most important, I want to talk about the developer friendliness. KDB+ is distributed along with a DSL, which is q. q is very developer unfriendly. When an error encountered, it just raise the error type, without any line information for you to anchor, which increase the difficulty to bug fixing and code maintenance.

    Alternatives to KDB+ I can come up with:

    1. to give the system column oriented storage, a traditional K-V store works, even a column storage backed RDBMS;
    2. for the in-memory feature, a lot of open source implementation of in-memory cache, like redis, memcached, etc.

    The key feature for the alternative, I think, is that the API is written in standard C/C++. You could have a lot of utilities to ease your development.

    While not definitive as far as speed goes, it certainly helps that the entire executable is roughly 400kb, depending on the version, meaning it can live entirely in CPU cache.

    @afekz Thank you for your comments. But if not for speed, why do people bothering the cache? :p Yes, you're right this may be one of the pros for KDB.

    One reason KDB is so friendly to Time Series data is that queries are implicitly stable in terms of row ordering. Many other solutions listed don't offer this guarantee.

    @Chuu Thank you for your comment. By "other solutions", may I know your alternatives? :)

    @Summer_More_More_Tea I could write an entire article on this. I will say (1) There is no substitute for KDB if you use it to its full potential, (2) Most users do not use, and do not need to use KDB to its full potential, (3) There is no FOSS solution that comes close, but with the data volumes most individuals handle, many FOSS systems work very well, (4) There are other commercial solutions that are eating into KDBs market share but I don't have enough hands on experience to make a recommendation. ExtremeDB is one that people I trust have strongly considered.

  • KDB is useful for two reasons: - Storage of data; and easy access to the data (i.e. querying ticks..etc) - Rich query language that supports many Quant functions

    however; what KDB does not do well; is the quant query language.

    I have evaluated KDB, Matlab, and R. So far R is the winner.

    I have not found any fast solution for storing and retrieving data; compared to using flat binary files; which are divided by month for ease of acces. My app can read 1 million of tick data in 3 seconds; and do backtesting accordingly.

    for retail traders; i suggest you use flat files (binary instead of text; for quick read/write). MT4 data structure is a good example to follow.

    It is cheap, free, and fast!!!

    How do you compare a database system and two scientific-oriented programming languages?

    can you rephrase your question?

    KDB is a database. Matlab and R are programming languages/environments. Q is KDB's programming language and would be somewhat comparable strictly as a language to J.

    Unless you're intending to leverage Q though, there is no point to investing in KDB. As a pure tick warehouse it is inferior to other column stores.

    @Chuu, do you have any data to back up your claim?

    @Datageek No, however I seriously doubt many KDB users and administrators would disagree with the statement. Column stores are a dime a dozen these days and many have superior feature sets to KDB if you just want to get your data back out exactly as you put it in.

    @Chuu I only started looking into kdb+ but so far I can see that performance is much better than the column datastores that I know of. Interesting fact - kdb+ is a column based database as well. See:

    @Datageek Most KDB users are intimately aware that KDB+ is a column store because administrating it requires in-depth knowledge of the on-disk structure of the database. Also, keep in mind that KDB is an in-memory database, and the insert performance numbers cited do not include persisting to disk.

    try mongodb. For many cases it can be as fast as kdb

  • Have a look at Kona which is a FOSS project trying to be compatible. Also Tom Szczesny has done some work on its predecessors namely A. I hope this helps.

    Also if you are not looking for a perfect substitute you can have a look at other Time Series Databases like InfluxDB, Java Chronicles, OpenTSDB, KairosDB which are all Open Source. There are commercial ones as well out of which OneTick is targeted at Tick Data management.

  • We've created a roundup of the top column-oriented database systems:

    This includes kdb+ and some open source alternatives.

    Open: InfluxDB, Java Chronicles, OpenTSDB, KairosDB

    Closed: oneTick, McObject, Teradata Database, vectorwise, sybase, vertica

    We have done some initial work at benchmarking common time-series queries and found open source monetDB to be particularly fast...we will hopefully be able to publish some of those results at a later date.

  • Lots of people focus on data storage ability and compare KDB with other SQL/query-based databases. Such comparison is like considering "Is a Ferrari good for running a bus route?"

    KDB+ is capable of manipulating and querying large data set. The performance is fast (in comparison to most RDBMS) due to its column based storage, but it's not her strongest suit. In fact, it could become a serious pain if the query is not well written or when the server just doesn't have enough memory. From my experience, lots of KDB applications do not even host Realtime/Historical databases.

    With her simple messages publish/subscribe mechanism and flexible IPC calls, one could build complex event processing engines (and even a network of engines) to operate on these time series data easily. Its performance can easily outmatch other engines written in Java/C due to Q vector processing and loads of functions that tailored made for time series data.

    Hi MK Lee, what do you mean by 'her' in 'With her simple messages'?

  • Recent benchmarks of KDB+ vs. other big data technologies, Kdb still comes out on top 5, especially considering hardware costs for this particular data set.

    [1.1 Billion Taxi Rides on kdb+/q & 4 Xeon Phi CPUs]

  • At the risk of reopening an old question, I thought that I would offer my experience.

    I worked for a competitor of Man AHL (who created artic). We used a columnar database called HP Vertica. Its not free unfortunately. We used it as a huge time series database for many use cases. We had one cluster of 3 fairly powerful machines that gave us redundancy if one failed, and had tables with over 100bn rows without issues. It is SQL compliant, and ACID compliant. There were tweaks that could be made to control table/column compression & distribution and replication.

    It has great support for datawarehouse operations, like fast ingestion and deferrable constraints.

    It did helpful things like auto-normalisation of low cardinality columns etc - ie, one could control the logical and physical schema quite precisely and easily. We could run a select distinct of a single column over 130bn+ rows in less that 10ms. We used it for time series storage (daily and tick) with one observation per row (ie, one tick/close etc per row).

    We were able to build some interesting patterns:

    • We could store all incoming data against source identifiers (eg, bloomberg ticker), and then pre-materialize the symbology joins vs our internal identifiers.
    • We could store all versions of a tick and pre-materialize the filter for only the latest values.
    • We wrote a batch job that could copy entire SQL server databases directly into the data warehouse - with automatic handling of nice columns / tables etc. This made dealing with legacy / small / complex datasets very nice as now you could join all enterprise schemas over a single SQL connection. The data scientists used that a LOT.

    We were able to do that over the entire dataset of 20k+ symbols * 20+ years of tick data as a series of daily batch runs. This made it very easy to manage the dataset for operators, as they could write selects and deletes using the standard SQL that we all love.

    The jdbc driver was also pretty good and offered all the usual semantics and datatypes. At one point I also hooked it up to an apache Spark cluster of 256 cores and managed to achieve a parallel write over jdbc to the same destination table of over 1.2m rows per second (including the commit).

    The developer experience was great as it just mainly worked as a massively powerful RMDBS with a lovely SQL experience. Vertica / HP has invested a great deal of time in providing many useful helper functions (time & date, and analytical functions) and the overall feeling was quite similar to PostgreSQL (which is a good thing).

    Overall, a nice way to achieve horizontal scalability with little required of the developer or the DBA. You just need your cheque book.

    Nowadays, even this approach is probably out of data now that we have things like (really) fast SSDs, cheaper and larger RAM capabilities (epyc rome supports 4TB ram on a single machine!), SPARK, and better tooling around NoSQL implementations. We also have great new initiatives like TimeScaleDB that is FOSS. Combine FOSS with the short setup times of modern hardware on public cloud and you could probably iterate to something similar for little upfront time and money.

License under CC-BY-SA with attribution

Content dated before 7/24/2021 11:53 AM