"SHA-1 is an algorithm and what it does is: it takes some data as input and generates a unique 40 character string from it." (source)
That is just so horribly incorrect I had to take a note - a hash function typically does not return unique results.
Wikipedia says "A hash function is any function that can be used to map data of arbitrary size to data of fixed size."
With this definition the function has no chance to do that.
Just like you'd fail to assign a 8 bit (2 ^ 8 states) unique ID to anyone in a sizeable country, you cannot assign even a 160 bit unique value to any input of typically well over 160 byte (>> bit) size.
Obviously if one could assign a unique 160 bit identifier to data of any size, that would mean a universal 160 bit (40 character) compression. Possibly in a huge number of steps though, a decompression algorithm could just step through all the valid inputs spiralling through 1 bit, 2 bit etc. candidates, calculating the hash of each, checking when it matches up the desired result.
And otherwise this would mean a bijection between sets of different number of elements - 2^160 vs. infinite.
Anyway, the conclusion is even Git doesn't have the powers to make SHA-1 do the job. OMG :)
Saturday, 14 January 2017
Monday, 2 January 2017
Which Programming Language for Analytics?
I am in the progress of turning towards Python for analytics.
It's not a quick shift, has been going on for a while. I'm busy with my day job as a developer, studying here and there, and have historically got more involved with R. Then I should sometimes see people to avoid getting shot by a Walking Dead fan.
Someone recently told me that my preference for Python is not generally justified when it comes to analytics, to my surprise the alternative he mentioned was Java. Needless to say, I was looking at things from an analyst's perspective, who applies various methods to extract insight from data.
I wanted to double check - but I had to realize it's non-trivial to find (or I was unlucky at finding) a respective and current ranking, more interestingly, a useful visualisation of how languages perform in the analytics segment.
The other thing I realised was it wasn't that difficult to get a grip on the problem. Even if it's a less reliable grip than one stemming from a more complex methodology.
A small code has pulled a lot of data from GitHub. Then it got cross-checked with StackOverflow data.
The (privately drawn) conclusion suggested by the charts is that Python is essentially the best all-purpose choice for analytics currently. Java is coming up fast, but it has a brutal market share anyway, so it's possibly just, as everything else, getting soaked in analytics, while not really driving the change from an end-user's (analyst's) perspective. R appears to fall out of grace slowly, but steadily - with handicaps in the Machine Learning area.
The most recent version of the full document/source code is available here, or by clicking the miniature. At the time reading, I'm probably still in the progress of making improvements to these.
So the conclusion may as well change - but odds are it's accurate. These findings should primarily be a predictor of the future share of languages among now wannabe analysts, and to be representative of the present to a smaller extent.
A summary of my considerations about Python's performance is presented below.
For those who have attempted to follow the events in the analytics area, these segments (or the segments these keywords proxy towards) may be familiar and of high importance.
For others, machine learning and deep learning are expected to be the key players for processing data produced and archived at big data levels, and to be able to encompass models for such complex systems as human thinking.
As final thought, the situation probably abruptly changes once it is about architects looking to choose an implementation language for some software they plan to provide for analysts. However, the versatility of Python can possibly counterbalance the efficiency trade-offs it brings into the development.
Also mind that GitHub, the (at the time writing) most used source code hosting option, is the vehicle for many courses on Coursera, which then is/was the most popular MOOC, and as such, heavily affects even a massive vehicle like GitHub (gathering data about this is in progress - but it is getting clear already that these courses are close to the heart of R's popularity on GitHub).
For those interested, I also had a quick peek at server-side languages, with the results summarized in this (continuously updating) document.
It's not a quick shift, has been going on for a while. I'm busy with my day job as a developer, studying here and there, and have historically got more involved with R. Then I should sometimes see people to avoid getting shot by a Walking Dead fan.
Someone recently told me that my preference for Python is not generally justified when it comes to analytics, to my surprise the alternative he mentioned was Java. Needless to say, I was looking at things from an analyst's perspective, who applies various methods to extract insight from data.
So... is Java Taking Over?
I wanted to double check - but I had to realize it's non-trivial to find (or I was unlucky at finding) a respective and current ranking, more interestingly, a useful visualisation of how languages perform in the analytics segment.
The other thing I realised was it wasn't that difficult to get a grip on the problem. Even if it's a less reliable grip than one stemming from a more complex methodology.
Digging on GitHub
A small code has pulled a lot of data from GitHub. Then it got cross-checked with StackOverflow data.
The (privately drawn) conclusion suggested by the charts is that Python is essentially the best all-purpose choice for analytics currently. Java is coming up fast, but it has a brutal market share anyway, so it's possibly just, as everything else, getting soaked in analytics, while not really driving the change from an end-user's (analyst's) perspective. R appears to fall out of grace slowly, but steadily - with handicaps in the Machine Learning area.
The most recent version of the full document/source code is available here, or by clicking the miniature. At the time reading, I'm probably still in the progress of making improvements to these.
So the conclusion may as well change - but odds are it's accurate. These findings should primarily be a predictor of the future share of languages among now wannabe analysts, and to be representative of the present to a smaller extent.
The Winner - in My Opinion
A summary of my considerations about Python's performance is presented below.
Aspect | Quick Assessment |
Analysis | Prime Language, Increasing Share |
Spark | Significant, Reducing Share |
Machine Learning | Prime Language |
Deep Learning | Prime Language |
Big Data | Strong Choice |
AWS | Strong Choice, Increasing Share |
Data Science | Prime Language / Strong Choice, Increasing Share |
Mining | Strong Choice |
Visualization | Prime Language / Strong Choice, Increasing Share |
Chart | Good Enough Choice from a Diverse Competition |
Kaggle | Prime Language, Increasing Share (possibly due to R's demise <sniff/>) |
For those who have attempted to follow the events in the analytics area, these segments (or the segments these keywords proxy towards) may be familiar and of high importance.
For others, machine learning and deep learning are expected to be the key players for processing data produced and archived at big data levels, and to be able to encompass models for such complex systems as human thinking.
Different Views
As final thought, the situation probably abruptly changes once it is about architects looking to choose an implementation language for some software they plan to provide for analysts. However, the versatility of Python can possibly counterbalance the efficiency trade-offs it brings into the development.
Also mind that GitHub, the (at the time writing) most used source code hosting option, is the vehicle for many courses on Coursera, which then is/was the most popular MOOC, and as such, heavily affects even a massive vehicle like GitHub (gathering data about this is in progress - but it is getting clear already that these courses are close to the heart of R's popularity on GitHub).
For those interested, I also had a quick peek at server-side languages, with the results summarized in this (continuously updating) document.
Subscribe to:
Posts (Atom)