Is R being replaced by Python at quant desks?

  • I know the title sounds a little extreme but I wonder whether R is phased out by a lot of quant desks at sell side banks as well as hedge funds in favor of Python. I get the impression that with improvements in Pandas, Numpy and other Python packages functionality in Python is drastically improving in order to meaningfully mine data and model time series. I have also seen quite impressive implementations through Python to parallelize code and fan out computations to several servers/machines. I know some packages in R are capable of that too but I just sense that the current momentum favors Python.

    I need to make a decision regarding architecture of a subset of my modeling framework myself and need some input what the current sentiment is by other quants.

    I also have to admit that my initial reservations regarding performance via Python are mostly outdated because some of the packages make heavy use of C implementations under the hood and I have seen implementations that clearly outperform even efficiently written, compiled OOP language code.

    Can you please comment on what you are using? I am not asking for opinions whether you think one is better or worse for below tasks but specifically why you use R or Python and whether you even place them in the same category to accomplish, among others, the following tasks:

    • acquire, store, maintain, read, clean time series
    • perform basic statistics on time series, advanced statistical models such as multivariate regression analyses,...
    • performing mathematical computations (fourier transforms, PDE solver, PCA, ...)
    • visualization of data (static and dynamic)
    • pricing derivatives (application of pricing models such as interest rate models)
    • interconnectivity (with Excel, servers, UI, ...)
    • (Added Jan 2016): Ability to design, implement, and train deep learning networks.

    EDIT I thought the following link might add more value though its slightly dated [2013] (for some obscure reason that discussion was also closed...): https://softwareengineering.stackexchange.com/questions/181342/r-vs-python-for-data-analysis

    You can also search for several posts on the r-bloggers website that address computational efficiency between R and Python packages. As was addressed in some of the answers, one aspect is data pruning, the preparation and setup of input data. Another part of the equation is the computational efficiency when actually performing statistical and mathematical computations.

    Update (Jan 2016)

    I wanted to provide an update to this question now that AI/Deep Learning networks are very actively pursued at banks and hedge funds. I have spent a good amount of time on delving into deep learning and performed experiments and worked with libraries such as Theano, Torch, and Caffe. What stood out from my own work and conversations with others was that a lot of those libraries are used via Python and that most of the researchers in this space do not use R in this particular field. Now, this still constitutes a small part of quant work being performed in financial services but I still wanted to point it out as it directly touches on the question I asked. I added this aspect of quant research to reflect current trends.

    I am not sure but definitively there are some adventages for python in regards to the development of packages in some areas.

    You are a highly respected member of this community but I am getting a worse and worse feeling about this question. One of the examples of questions that we don't want on this site is "What programming language should I use?" (quant.stackexchange.com/help/on-topic). When you look at the discussions in the comments you can see why: They are getting more and more contentious - and you seem to have made up your mind anyway. I think if somebody with less rep had asked this question it would have got closed right away. I think best would be to close this question. Do you see my point?

    @vonjd, I have not made up my mind else I would not have asked. And we should be fair in acknowledging that some on this site have a very strong vested interest in leaning towards R because they derive a portion or all of their living from writing R code, hence their rather strong wording. I defend the question because the question and hopefully answers are imho very relevant to those working at quant desks or potentially to those who pour many tens if not hundreds of thousands into projects.

    But I am of course entirely open to let the community vote to have the question closed if most think it is not relevant nor targeted enough (though I listed very specific use cases that I am interested in)...

    By the way, is there a way to vote or suggest allowing certain questions that may currently not fit the desired format? I find questions like "which language is recommended for xyz" or "is abc-regression better suited to tackle xyz than bcd-regression" very important and useful for those who work in this field. At least a lot more useful than many questions that are kept open of the type "where can I download free tick data" or "does yahoo finance backward adjust dividend splits"...

    Fair enough. You could raise this on meta when you think that the rules of this site should be changed.

    Upvoted on meta.

    I noticed that there has been a relative fury of down/up votes on answers to this particular question. While I think there is value in a referendum on the subject, I would encourage more people to share their thoughts in the comments and new answers especially those with experience using both languages.

    I did not notice a fury nor downvotes. And I fully agree with your suggestion. What really currently discourages me to again more actively participate on this site is pressure to conform to strict "rules" and guidelines. Humans are not bits and bytes nor does efficient and intelligent learning involve black and white Q&A formats. As this question demonstrates the format itself is already questioned because some seem to feel incredibly uncomfortable to go out of their "rules-based" comfort zone. I also like to see more healthy debate and sharing...

    Many people put a lot of effort into this, so I would be interested whether the answers helped you to arrive at a conclusion?

    @vonjd, no I have not yet made a decision. But I am much better informed thanks to some of the answers and my spending more time with packages such as data.table and rcpp. It does not change my impression of bits and pieces being "glued together" in R in order to run more performant computations (Rcpp is in effect a bridge to run compiled C++ code and data.tables is a highly indexed data structure which should not be compared with solutions that make no use of indexing). My main concern at this point is that I will end up with code bases in multiple languages to achieve ...

    ...performance that matches or exceeds what can be done purely in Python. For example, any statistical or numerical techniques that cannot be vectorized require me to essentially maintain a C++ code base to beat code operations in Python. Similar applies to visualizations: Most dynamic visualizations or visuals that allow me to pan/zoom or otherwise manipulate rendering during run-time requires knowledge of .js and/or D3.js. Python on the other hand allows me to more easily interface with existing visualization libraries I already peruse. But as said, I have not yet come to a final conclusion

    Thanks, vonjd, I took a quick look but am frankly not a big fan of generalized comparison reviews because it does not address specific needs (for obvious reasons).

    It's not far enough along in the development cycle for your needs, but keep an eye out for julia in the future. I've played around with it a bit myself and it has the potential to replace/complement both R and Python for this kind of technical work.

    @MattWolf Perhaps your Jan 2016 update would be better as a separate question. E.g. "What libraries/packages would you recommend to do deep learning in quant finance applications?" (That leaves it language-neutral, which may or may not be a good idea...)

    @DarrenCook, while I agree that this site should encourage much more exposure to deep learning in quant finance I believe the addition (Jan 2016 update) is very relevant to this question. Deep learning is perhaps the area at banks, hedge funds, and at private equity that sees the most **incremental** investment in terms of funding and talent hiring. I do think that it is an area that clearly favors Python over R and I would love to hear from other practitioners about their take.

    @MattWolf OK. I'm just saying it is better to start a new question than update one that already has answers, including an accepted answer.

  • statquant

    statquant Correct answer

    6 years ago

    My deal is HFT so what I care about is

    1. read/load data from file or DB quickly in memory
    2. perform very efficient data-munging operations (group,transform)
    3. visualize easily the data

    I think is is pretty clear that 3. goes to R, graphics and ggplot2 and others allow you to plot anything from scratch with little effort.

    About 1. and 2. I am amazed reading previous post to see that people are advocating for python based on pandas and that no one cites data.table The data.table is a fantastic package that allows blazing fast grouping/transforming of tables with 10s million rows. From this bench you can see that data.table is multiple time faster than pandas and much more stable (pandas tend to crash on massive tables)

    Example

    R) library(data.table)
    R) DT = data.table(x=rnorm(2e7),y=rnorm(2e7),z=sample(letters,2e7,replace=T))
    R) tables()
         NAME       NROW NCOL  MB COLS  KEY
    [1,] DT   20,000,000    3 458 x,y,z    
    Total: 458MB
    R) system.time(DT[,.(sum(x),mean(y)),.(z)])
       user  system elapsed 
      0.226   0.037   0.264 
    
    R)setkey(DT,z)
    R)system.time(DT[,.(sum(x),mean(y)),.(z)])
      user  system elapsed 
      0.118   0.022   0.140 
    

    Then there is speed, as I work in HFT neither R nor python can be used in production. But the Rcpp package allows you to write efficient C++ code and integrate it to R trivially (literally adding 2 lines). I doubt R is fading, given the number of new packages created every day and the momentum the language has...

    EDIT 2018-07

    A few years latter I am amazed by how the R ecosystem has evolved. For in-memory computation you get unmatched tools, from fst for blazing fast binary read/write, fork or cluster parallelism in one liners. C++ integration is incredibly easy with Rcpp. You get interactive graphics with the classics like plotly, crazy features like ggplotly (just makes your ggplot2 interactive). For trying python with pandas I honestly do not understand how there could even be a match. Syntax is clunky and performance is poor, I must be too used to R I guess. Another thing that is really missing in python is litterate programming, nothing comes close to rmarkdown (the best I could find in python was jupyter but that does even come close). With all the fuss surrounding the R vs Python langage war I realize that vast majority of people are simply uninformed, they do not know what data.table is, that it has nothing to do with a data.frame, they do not know that R fully supports tensorflow and keras.... To conclude I think both tools can do everything and it seems that python langage has very good PR...

    Hmm I guess I need to disagree with you here regarding visualizations. R packages are still lightyears behind efficient and especially dynamic visualization. Every first year IT student can chart a time series from scratch. What people want and need is visualization of millions of data points that a charting app can down sample. Fast zooming and panning and handling of annotations. I have not seen anything in R that comes even remotely close.

    Secondly datatables in R are very very slow. Throw a few million time series data points at it and data frames go to their knees. The only thing I have seen that was fast was an implementation that perused memory mapping. But one could argue this is just an interface R peruse ...as soon as you actually grab the data and run R functions over it becomes very slow. Caveat here: I have not looked at any new developments over the past 8 months in R space. If there is anything new I would be happy to be pointed to it.

    And when you talk about something crashing then the problem lies with improperly providing the required input format. The same can happen in OOP languages, Python and R.

    Every now and then there will be stuff that you will not be able to do with `data.table` which will cause you to use regular R `data.frame` and curse yourself that you had started off with `pandas` instead. For instance, `data.table` can fail to read many CSV files that both regular R `data.frame` and Pandas can read easily. Or, try working with data that has 600 columns & writing loops over columns. In `data.frame` or `pandas`, you can loop over `for i in x.columns:` and do something like `x.loc[:,i] = ..`, but in `data.table` you might need 600 lines for each columns

    @uday: about loops... that's also false, look at `.SDcols` this is faster and easier than a loop

    @statquant, nope it is not. frequently breaks down for mixed text and number CSV and you will tired of writing on SO, where R experts will remind you frequently to give reproducible test cases even if you mention that you are reading a 2 GB csv file, etc.

    @statquant, that is a pretty bold claim you make. I am happy to whip up a few test batteries when I find time in the next couple days but being a kdb user myself I find that pretty hard to believe. I will report back with some numbers. Thanks for your answer and for sharing your insight. My intent to move away from kdb by the way is the precise reason that caused my asking this question.

    @MattWolf what claim ? do you refer to this benchmark: https://github.com/Rdatatable/data.table/wiki/Benchmarks-:-Grouping

    @statquant, another issue with `data.table` is that their founders and followers are somewhat too protective (although similar in the case of `pandas` too) and any sign of mentioning issues with `data.table` you can expect your post on SO to get rapidly downgraded. At least in the case of `pandas`, (1) the source code is available on github (you can do an easy search on web without downloading) versus downloading the sources code from CRAN, (2) you can easily overwrite `pandas` to customize your own subclass of `pandas` dataframes (I use two such specialized subclasses for my work).

    That grouping and transforming on 20 million rows takes less than 1 second as well as you stating that the speed reaches kdb performance benchmarks...

    @MattWolf ok, I can help on this if you fancy it, you can ping me at username At outlook D.T cOm. But your initial question was R vs Python... not kdb :). Actually just tried on my personal laptop to sum a column of 2e7 random gaussian number , and average another (independent) by group (23 of them) it took <200 ms, (the table was 500M in memory)

    Statquant, I appreciate your offer and will contact you. I only mentioned kdb as you brought it up. I am definitely interested in gaining more insight into the data.table package you mentioned because performance in R was a deal breaker for me so far.

    @statquant, :-) given how you are persistent on `data.table`, I will try it for purely numeric data and test out if it is faster than `pandas`. thanks for sharing the link. That `.SDcols` option might not have been there two years ago when I was trying to use `data.table` extensively (or may be I overlooked it).

    @statquant, I played a bit with the data.table and while it seems indeed significantly improve data table grouping and table transformations my original concern is not addressed. For computational efficiency the organization of input data is only one part of the equation. The main resource consumption will be taken up by the respective statistical and mathematical computations and that is where I am not (yet) sold that R comes close to Python's stats and math libraries in terms of computational efficiency.

    @Matt Wolf I'll be honest I am not sure you really have a clear idea what you want. What do you do that require a lot of computational power? If thats linear algebra then you have Rcpp Armadillo or other packages that I know are blowing numpy away... Can you be clearer please?

    I thought I was very specific about my requirements in the question I originally asked. Just because some of the requirements require large data quantities and others do not should not be confused with me not knowing what I want. I look for a development and analytical testing architectural change that needs to cover both, the analysis and visualization of vast amounts of time series based data and options order book data on any end of the spectrum

    as well as pricing derivatives via Monte Carlo, PCA, or more mathematically involved PDE solvers on the other end of the spectrum. I get the point that indexed data tables allow for fast access to chunks of data but this only serves the starting point of any analytical or numerical exercise...

    what I need to better understand is the computational efficiency of the actual statistical and numerical procedures. Your data tables can be as fast as they want but if the actual visualization of time series in R gets on its knees when you throw a million or so data points at it then you have your bottle neck right there. Same goes for MC pricing. Is that clearer?

    Ok... I think those 3 things are numerically extremely different, as one should take advantage of massive parallelisation, the second very efficient linear algebra and the third GPU computing. Each would require a post on their own. I doubt R will allow you to obtain cutting hedge implementations in all those fields, never would Python... As far as data visuaIisation is concerned have you looked at http://www.amazon.co.uk/Graphics-Large-Datasets-Visualizing-Statistics/dp/0387329064 ? I think we should continue this discussion on a chat/email as this gets a bit off topic.

    Agree fully that each requires different approaches and poses different requirements in general. However in the end of the day I and my team still needs to get our work done in our framework of choice. For visualization, for example, we peruse a C# Frontend that we equipped with massive parallelization capabilities, customizability, and the ability to make use of hardware based technologies. For parallel and async computing we also interface with different technologies which is precisely why we look for a framework that boasts strong capabilities in interfacing with other components.

    And hence we look for ways in either R or Python to migrate part of our design and pricing framework to. I tremendously benefitted from this discussion already and you make lots of very good and above all informed points. Thanks a lot for adding so much value.

    OK last thing then, I did not realize that you were OK to spend substantial time in heavy development. If that's the case even if I can not help you as I did not do it myself I would still go the R route. All bleeding hedge numerical procedures provide a C++ API and Rcpp provides the easiest way to leverage this. Look at the R task views for a lot of references. A far as graphical are concerned D3 is also coming...

    I am spending some time with Rcpp this weekend. Thanks for the pointer.

    About plots - R is really good at plotting 2d plots once and not touching them. It's meh at 3d plots and has virtually no support for mouse interaction. Everything else I think is spot on.

    @MattWolf I'd like to know your thoughts now that you had the time to research and compare.

License under CC-BY-SA with attribution


Content dated before 7/24/2021 11:53 AM