Tuesday, 28 June 2016

R: a quick permutation sightseeing

Listing permutations of a number of distinct elements is surely doable, as long as the number of elements is not too big. But with only a little pickiness, such an innocent task can raise many interesting (but small) challenges to deal with once everyday expectations (memory, heap, CPU efficiency) are respected.

I found myself facing this task which involved a list of permutations of 5 elements (related to the Draper Satellite Image Chronology competition on Kaggle), and while it's truly simple, I thought I would just have a peek at how far the common sense has progressed with it.

Answers to a StackOverflow question should tell something about it.

There were
  1. the brute force exhaustive search approach, probably the simplest of all (which is probably the most suited for my original motivation)
  2. the classic recursive implementation
  3. Museful's much more vectorized ("matrixized") solution, a variant of which had too crossed my mind (unfortunately the SO one is just as good if not better than mine would have been, so I think I shouldn't waste many words on it)
and some others which seemed conceptually equivalent to either of the above.

The problem was pretty much solved, but my brain kept ticking... so there is one more I would like to take a note of here.

Firstly, I did deal with this before, a very long time ago, and I had a solution which I annoyingly forgot about, but it was easy to recreate something similar.
That key concept was that (over the same elements) a permutation is only different from another one by (a number of) exchanges between pairs of elements.

Actually, that's almost too easy to see, as just by always exchanging a misplaced element with the one with the right number, and repeating it, in no more than n exchanges (actually n-1 is sharper, and probably the average number of minimum required steps is much lower) we should get there.

E.g. have 12345 want 54321

exchange 1 & 5, get:

52341

exchange 2 & 4, get:

54321

done.

In other words, just by exchanging elements, we can go from one permutation to the other.

So this is probably an option to list all the permutations just as well.

Going from 12345 to 12354 is a simple move: replace the last two. To get to the next meaningful step, a third element surely needs to be involved. Then we may get back to varying the first two. Then the elements in the most frequently varying pair need some fresh blood again.

To see what I mean, the pair variations over a given 'symbol set' may be:
  1. 12345 and 12354
  2. 12435 and 12453
  3. 12534 and 12543
Giving the first 6 permutations. The generalisable recipe of items for the above starters:
  1. don't do anything,
  2. replace the 3rd with one element from the pair,
  3. replace the 3rd with the other element from the pair.
The flow goes as:

This may still feel a bit unclear, but technically from the permutations of (n - 1) elements we get those of n elements by placing the new element into different positions within all n slots, which means in the first case it's not involved in the last (n - 1), then it substitutes a different one of those (n - 1) in each "big" step. The last (n - 1) elements are just shuffled around as usual.

Slightly modifying it (reverse order) for the ease of coding, we get the below implementation.

I was really happy to sketch this out, although it's not much of an improvement. If I really want to be nice to myself then it is, as it does not have to be recursive, surely has a light heap footprint (the momentarily state of the algo is stored in 2 * n integers), but otherwise isn't too interesting, at least in R, where vectorization over tons of short loops is much preferred. It could possibly be a good choice in C(++) if someone's bored enough.

Then I Googled permutations with pair switches and got to a really interesting PDF.
Their solution assigns a 'momentum' (direction) to each number at each step, and I don't fully understand it (yet). Powers beyond my control :)
But surely is a simple one and is more efficient in that this one doesn't put back the items from where they came from, i.e. only applies half of the switches. However, it does have to find the largest movable element.

So do I win or what? :) Maybe only as long as I resist the temptation to read beyond the first page of their stuff... I'd say, for the temporary illusion, not today! Sure they have something better.. :)

Monday, 13 June 2016

R: the "is ordered" exercise

Some programming treats look nicer to me, one of them is determining whether a series of numbers is ordered in one way or another.

It can come useful when trying to decipher some shuffled data set, which mainly strikes me on Kaggle. Albeit with a little flaw it could be done 'on the spot', i.e. trivial solutions are easy to implement, it seems like an interesting exercise if vectorization is a requirement.

And I'm trying to collect such things here and keep my brain moving, so I created a solution which I'll give below.

The function

Not much (just at a proof of concept level), but tested: link to the test code. Obviously if anyone has a better idea for the implementation, please give me a shout!

Sunday, 15 May 2016

Animations in R (part 1)

A sometimes forgotten feature in R is animations. This can add one more level to the visualizations, although it can't compete with proper dynamic pages. Whenever generating a phase on the go would be too demanding, however, as is frequently the case with some prototype or one-off, less performance optimized algorithm, it can still be the way to go.

#1

Just to remind myself that it's possible, here's a sinus plotter clone in an animGIF:

Obviously when created for the web it needs to be periodic within a few hundred frames, in the absence of the implied constraints it can be more flexible. By the way it's originally a microprogramming classic (to me, at least), e.g. this one was an example. Beyond that, its history spans way back, see the Lissajous-curves, for instance.

 

 

#2

Alas, this one's a video and I have little control over the positioning, but here we can see how the nature of an xgboost forest based prediction responds to more iterations (= new trees) for a binary classification problem with a noisy training set, and how much of the problematic overfitting can be prevented by reducing the maximum tree depth. Of course, it's only overfitting as long as those stand-alone points, constituting a few percent of the whole training data, are really noise, and not valuable information! Apparently, the max. depth is similar to some sort of a filtering parameter. I showcase the result of the training re-evaluated over the training set for depths of 1, 2, 3, and 4 below.

Obviously someone with a slightly mathematical mindset could then ask - why don't we allow for fractional depth values? Since that's the parameterization style the most common image processing filters allow for. Sharpness, blur amount etc. - it's too common to miss a limitation.
My thought there is that max.depth accomplishes more than one thing, namely at least 1. poses an expectation towards what can be effectively considered a significant cluster of true values 2. controls the complexity of rules (like if (v_i_1 and v_i_2 and .. v_i_max.depth) return(1)). This can both be good and bad, but a) some separation is worth a thought, e.g. filtering as a preprocessing step and b) in practice the preferred values may change by location, so it can be a very long story...

... and that's the intro for now, I have some plans about returning to this one over the future:





[To be continued ...]

Sources:
#1
#2

Wednesday, 11 May 2016

Yesterday's R dojo

Have been to an R code dojo meetup yesterday. These meetups are to let people try out & practice things in smaller groups, typically of 4-6. I tend to value any practical one over the classic presentation + Q&A ones as it's been not once communicated to me that frontal education has had shown the poorest performance, so despite it's price: avoid it like the plague. Prefer events where practice is involved!

QuickCheck - another testing method/philosophy


First of all, I wanted to delve into QuickCheck I heard about years ago. To me this one is more on the integration testing side, still a decent one (actually - what isn't a bit of integration testing?)

I believe the stochastic behaviour of many ML routines almost mandates having repeated tests - a single test case being passed is not convincing on its own, just as no one really trusts an algorithm that forecasts well once (but perhaps never anymore). And this is just one side of the coin - the output normally cannot be exactly known, but approximate values, with error bands, are typically acceptable (actually, I expect this could get as far as putting proper statistical testing in place). Combine that with an output of several values; 'narrow' rules of self-consistency instead of in-depth byte-to-byte specification of the output becomes an appealing option.

On the inbound train I got to installing it

and finding the steps at https://github.com/RevolutionAnalytics/quickcheck, then doing a

gave 2 more examples.
The examples are like (this one is from the quickcheck package vignette):

Unary function, unary output. What could have been a little more interesting to me, was defining a bivariate function - since it's operating on a much larger input space (+1 dimension), where random sampling of the space becomes one reasonable approach to somewhat evenly testing on the volumetrically exploding set of input states.

I had exactly 0 luck there:

Some - at first, incomprehensible, recycling occurs and I'm flooded with warnings like:

9: In y + x :
  longer object length is not a multiple of shorter object length

Sure it could be implemented as two embedded test() calls, but I assume there's a nicer, terser way.

Anyhow, I'll have to postpone stuff again ... somehow (no surprise there) I was the only person interested in this thing :)

PowerBI


So I found myself in a group of lads not so much interested in QuickCheck but Microsoft's new BI platform, thanks to Marczin for bringing up this.
The "big thing" here  was that it can operate with R, so that streams of data processing get combined with R's powers at statistical analysis and data visualization.

Disappointing, but this is for Windows users. At least the desktop version. As I'm not much of a business user, I was sort of sent away when I wanted to run things online... all right, the more of me remains for everything else then :) (everything else applauding)

We did manage to put things together. After dragging the fields of the automagically linked tables to the inserted R object's "values" list, and writing a tiny script referencing values in dataset$field style, R charts appeared. Actually, it wasn't at all as intuitive as was MS Access (2000/XP) ages ago, I mean, putting the data flow together took some time to work out, also I had some redundancy in one of my CSV's so we wasted some time on that, but it finally worked.

Microsoft R Open


What is more relevant is that it relies on Microsoft R Open, which in turn does work on various platforms. That I'll surely have to give a closer look.
I somewhat tend to forget what I saw: xgboost, for instance, comes up in my mind as something written in a low-level way, e.g. in C or C++ at the heart. However, if it isn't much so (for instance, matrix operations are still left to be carried out by R's native solutions), then R Open's capabilities could speed things up. Appears I'll have to see that for myself ...

Friday, 8 April 2016

Working on a longer coursera project locally on Lubuntu

Today I decided to return to the practice of using a VCS due to the data science capstone project's growth. I have repos here and there, but somehow they just don't want to perfectly fit my needs (there's either size limit, lack of privacy or I'd have to pay).

I already believed I needed a local VCS since on longer tube journeys and flights I may not have internet access.

I supposed I'd choose git as I was hoping to deepen my experience with it, so that I can get more effective with the online hosted version as well - just about everything seems to happen on GitHub these days.

The client


I first checked out the clients - at least those available on Linux. I so far haven't come upon those feature rich UI's that I was used to on Windows, but I still demand something similar, so it was crucial (those tiny little annoyances which often go unnoticed, can really slow down work).

Formerly I didn't have much luck when choosing the weapons for Subversion, back then I couldn't find a *nix match for the handy TortoiseSVN. I have RapidSVN but the version I have is exhaustingly bogus.

As git I assumed is trendier, I had better hopes for better maintained clients. I found the default range on this page, and with my criteria in mind (free, open-source, working, preferably cross-platform) this quickly narrowed down to git-cola.

It appears good enough (although it froze a couple of times while authenticating with the server in my tests), however the aptitude version is way outdated (or at least its about dialog, pretending to be from 2012), so I had to uninstall that one (apt-get remove git-cola), and install it from source.

It is also written in python, which I'm learning as time allows, so the sources might give me a lesson or two on that.

The "server"


While some git components, especially the CLI clients should have already been present on my system, I checked the official installation steps and kicked off with a further

sudo apt-get install git-all

just to be sure ... further from that, after a little thinking and experimenting, it turns out that the client tools with a local repository are probably all I needed to get going with my local project.

... and it works!


I only had to initialize a new repo, and it was working straight ahead.
I mean, after untarring, it already works with

~/Downloads/git-cola-2.5/bin/git-cola

if someone enjoys inconvenience. I decided to call it a day - TBC.

-----

And as a little perk for the night, I accidentally (was queued for later in preference over some Netflix whatnot...) watched a TED talk with Linus Torvalds which surprised me with mentioning his share of creating Git as well as Linux. Nicely done!

Update

make prefix=/usr install

executed from within the git-cola directory (~/Downloads/git-cola-2.5 according to the above) does the job, so that git-cola can be now launched with ease from the terminal, and also gets available from the "Programming" menu group.

Sorry about the formatting, at this point I'm lazy.

Wednesday, 6 April 2016

GitHub and R

I've been to a workshop preceding a meetup where the basics of how GitHub can be conveniently used with R Studio was the topic.

One benefit: I finally got my hands a little dirty with creating GitHub issues (although to date I tried to avoid that ... partly because I only ever had public repositories and possibly I didn't want to share every everything, especially not exemplary beginner's mistakes :) ).

Another one: Being reminded of continuous integration. One day I'm sure it'll come useful. The only problem with it is it didn't fit into the workshop, but here are the related slides for that one day:

http://mangothecat.github.io/github-workshop/05-best-practices.html#continuous-integration

And, say, the third: Catching sight of the "git" tab in the R Studio GUI, once that's sufficiently enabled for the project. Again, one fine day ...

Monday, 7 March 2016

R: the window diff exercise

The implementation of moving/sliding window transformations is one of the basic duties one may come across while developing various monitoring, time series analysis or signal processing tools. Third party libraries probably embody solutions, but it's a classic so I'll devote an entry to it.

An example could be calculating moving average over a time series. This pattern popped its head up again at an R coding dojo (in about 2013 ... that's how fast I write :) ). The concrete scenario has something to do with the fingerprinting that some antivirus software perform in the heuristic search for high-entropy chunks in a data stream (file), which if found, suggests encryption was applied, possibly by a hiding virus (Derek's Shape Of Code entry has more about it).

Moving away from his original topic - where the brevity of R does a better job than a lengthier code, it appeared to me this could be a nice example to illustrate a performance optimization concept which can come handy at other times, certainly in some more performance-oriented scenario.

Notations


n: number of samples (say ~ 1M)
m: window size (say ~ 1k)
k: a numerical constant (the point is it is independent from the parameters)

The original approach


cost ~ n * k * m

Thinking of the aforementioned R code, that one browses through all possible positions for time windows of a certain size, and computes the ratio of uniquely occurring symbols to the size of the time window for each. The excerpt I'll focus on:

This calculation can become massive, as its cost (in that form) is proportional with the window size multiplied by roughly the number of window positions - sapply() executes for each relevant X, and calls unique() with a slice of the input data. The procedure examines most input symbols in the stream window.size times. Is it necessary? (Some relief is that the data is recalled from the CPU cache already, but with a little thinking we can come up with something better.)

Although R may not be the most efficient environment for running such code (as loops are preferred to be replaced with array operations, which could allow even for implicit parallelization), I will show how it is possible to reduce the complexity of the algorithm from somewhat an n * m to somewhat of an n + m (asymptotically) by applying a simple way of thinking twice (and I can tell that count.positives() accounts for the multiplier of "m").

Remark: actually, the original considers only each 5th one, but I'll take another route here, and so will ignore this.

First improvement


Assuming instead that we have knowledge of the first frame, it is possible to only take into account the bits that leave the window and those that move into it, namely the symbol that gets excluded on the left, and the one that gets included on the right. So, by focusing on the right items, the computational complexity (however, not in an asymptotic sense, yet) can be reduced.

cost ~ n * k2 * m + m * k3, main point is k2 < k (expected)


The frequency[] array holds occurrence counters inside the window. Once initialized, it is only maintained at a low(er) cost per window, hence the memory read operations become less redundant - although each sample is accessed twice.

Second improvement


Now one might go back in thinking to that the subject examined is basically the number of those symbol counters which are non-zeroes. When changing a counter it may either enter that set, or leave it (i.e. change to zero). Therefore, once we have a starting point (described the first window) it is sufficient to focus our attention solely on the values which significantly change - were positive and are becoming zeroes (meaning the related symbol is no longer present in the new window), or were zeroes and are becoming positive (meaning the related symbol appears in the new window).

So similarly:


cost = O(n + m), m for initialization, n for each step

This now has the promised characteristics.

As a note, when sliding a window over floating point values, e.g. for the moving average calculation, one should note that a value may have an impact of a different precision when entering and exiting the window. This may lead to accumulating numeric errors, and in the end, have a significant impact on the value computed for the last window position. A similar problem emerges if the data changes in the magnitude of variation over certain sufficiently lengthy intervals. But the above are quantized values - integers on each level of abstraction, so no rounding takes place, and this implementation caveat is naturally avoided.

At the finish line


The output of the code on my laptop is as below:

   Original  First imp. Second imp.
      5.846       3.443       2.051


Apparently, each iteration improved a little on the performance, over 2.5x in the end, although this number is not relevant. A true maximalist could go on for an rpp implementation, making it really fast.
Part of the reasons is that such an approach could further benefit from the unfair advantage of a C++ compiled parameter passing, which the first code cannot utilize (however, that one does have/tries to leverage the advantage of vector calculations to some extent).
Yet, the above is rather a demonstration of the concept than an über-optimization attempt, so no C++ comparison for now. The lucky thing (due to there are many constants tampering with the end results) is that even if the two improvements work with for loops, they prove faster in 'at least one' realistic run.


Closing words: this entry was partly written to exercise my writing skills, also as a follow up from years ago, however little significance it might be of, I liked creating it. The code somehow reminds me of code dojo exercises. (Not sure if this pattern could make it to be a Code Kata I heard about a good while ago?  One more thing I'll have to check out anyway, at least.)


To those who made it this far: thanks for reading!


The complete R code is available here.