Monday, 2 January 2017

Which Programming Language for Analytics?

I am in the progress of turning towards Python for analytics.

It's not a quick shift, has been going on for a while. I'm busy with my day job as a developer, studying here and there, and have historically got more involved with R. Then I should sometimes see people to avoid getting shot by a Walking Dead fan.

Someone recently told me that my preference for Python is not generally justified when it comes to analytics, to my surprise the alternative he mentioned was Java. Needless to say, I was looking at things from an analyst's perspective, who applies various methods to extract insight from data.

So... is Java Taking Over?


I wanted to double check - but I had to realize it's non-trivial to find (or I was unlucky at finding) a respective and current ranking, more interestingly, a useful visualisation of how languages perform in the analytics segment.

The other thing I realised was it wasn't that difficult to get a grip on the problem. Even if it's a less reliable grip than one stemming from a more complex methodology.

Digging on GitHub


A small code has pulled a lot of data from GitHub. Then it got cross-checked with StackOverflow data.

The (privately drawn) conclusion suggested by the charts is that Python is essentially the best all-purpose choice for analytics currently. Java is coming up fast, but it has a brutal market share anyway, so it's possibly just, as everything else, getting soaked in analytics, while not really driving the change from an end-user's (analyst's) perspective. R appears to fall out of grace slowly, but steadily - with handicaps in the Machine Learning area.

The most recent version of the full document/source code is available here, or by clicking the miniature. At the time reading, I'm probably still in the progress of making improvements to these.
So the conclusion may as well change - but odds are it's accurate. These findings should primarily be a predictor of the future share of languages among now wannabe analysts, and to be representative of the present to a smaller extent.


Picking a Language for Analytics and Machine Learning


The Winner - in My Opinion


A summary of my considerations about Python's performance is presented below.

AspectQuick Assessment
AnalysisPrime Language, Increasing Share
SparkSignificant, Reducing Share
Machine LearningPrime Language
Deep LearningPrime Language
Big DataStrong Choice
AWSStrong Choice, Increasing Share
Data SciencePrime Language / Strong Choice, Increasing Share
MiningStrong Choice
VisualizationPrime Language / Strong Choice, Increasing Share
ChartGood Enough Choice from a Diverse Competition
KagglePrime Language, Increasing Share (possibly due to R's demise <sniff/>)

For those who have attempted to follow the events in the analytics area, these segments (or the segments these keywords proxy towards) may be familiar and of high importance.

For others, machine learning and deep learning are expected to be the key players for processing data produced and archived at big data levels, and to be able to encompass models for such complex systems as human thinking.

Different Views


As final thought, the situation probably abruptly changes once it is about architects looking to choose an implementation language for some software they plan to provide for analysts. However, the versatility of Python can possibly counterbalance the efficiency trade-offs it brings into the development.

Also mind that GitHub, the (at the time writing) most used source code hosting option, is the vehicle for many courses on Coursera, which then is/was the most popular MOOC, and as such, heavily affects even a massive vehicle like GitHub (gathering data about this is in progress - but it is getting clear already that these courses are close to the heart of R's popularity on GitHub).

For those interested, I also had a quick peek at server-side languages, with the results summarized in this (continuously updating) document.

No comments:

Post a Comment