Sunday, 15 May 2016

Animations in R (part 1)

A sometimes forgotten feature in R is animations. This can add one more level to the visualizations, although it can't compete with proper dynamic pages. Whenever generating a phase on the go would be too demanding, however, as is frequently the case with some prototype or one-off, less performance optimized algorithm, it can still be the way to go.

#1

Just to remind myself that it's possible, here's a sinus plotter clone in an animGIF:

Obviously when created for the web it needs to be periodic within a few hundred frames, in the absence of the implied constraints it can be more flexible. By the way it's originally a microprogramming classic (to me, at least), e.g. this one was an example. Beyond that, its history spans way back, see the Lissajous-curves, for instance.

 

 

#2

Alas, this one's a video and I have little control over the positioning, but here we can see how the nature of an xgboost forest based prediction responds to more iterations (= new trees) for a binary classification problem with a noisy training set, and how much of the problematic overfitting can be prevented by reducing the maximum tree depth. Of course, it's only overfitting as long as those stand-alone points, constituting a few percent of the whole training data, are really noise, and not valuable information! Apparently, the max. depth is similar to some sort of a filtering parameter. I showcase the result of the training re-evaluated over the training set for depths of 1, 2, 3, and 4 below.

Obviously someone with a slightly mathematical mindset could then ask - why don't we allow for fractional depth values? Since that's the parameterization style the most common image processing filters allow for. Sharpness, blur amount etc. - it's too common to miss a limitation.
My thought there is that max.depth accomplishes more than one thing, namely at least 1. poses an expectation towards what can be effectively considered a significant cluster of true values 2. controls the complexity of rules (like if (v_i_1 and v_i_2 and .. v_i_max.depth) return(1)). This can both be good and bad, but a) some separation is worth a thought, e.g. filtering as a preprocessing step and b) in practice the preferred values may change by location, so it can be a very long story...

... and that's the intro for now, I have some plans about returning to this one over the future:





[To be continued ...]

Sources:
#1
#2

Wednesday, 11 May 2016

Yesterday's R dojo

Have been to an R code dojo meetup yesterday. These meetups are to let people try out & practice things in smaller groups, typically of 4-6. I tend to value any practical one over the classic presentation + Q&A ones as it's been not once communicated to me that frontal education has had shown the poorest performance, so despite it's price: avoid it like the plague. Prefer events where practice is involved!

QuickCheck - another testing method/philosophy


First of all, I wanted to delve into QuickCheck I heard about years ago. To me this one is more on the integration testing side, still a decent one (actually - what isn't a bit of integration testing?)

I believe the stochastic behaviour of many ML routines almost mandates having repeated tests - a single test case being passed is not convincing on its own, just as no one really trusts an algorithm that forecasts well once (but perhaps never anymore). And this is just one side of the coin - the output normally cannot be exactly known, but approximate values, with error bands, are typically acceptable (actually, I expect this could get as far as putting proper statistical testing in place). Combine that with an output of several values; 'narrow' rules of self-consistency instead of in-depth byte-to-byte specification of the output becomes an appealing option.

On the inbound train I got to installing it

and finding the steps at https://github.com/RevolutionAnalytics/quickcheck, then doing a

gave 2 more examples.
The examples are like (this one is from the quickcheck package vignette):

Unary function, unary output. What could have been a little more interesting to me, was defining a bivariate function - since it's operating on a much larger input space (+1 dimension), where random sampling of the space becomes one reasonable approach to somewhat evenly testing on the volumetrically exploding set of input states.

I had exactly 0 luck there:

Some - at first, incomprehensible, recycling occurs and I'm flooded with warnings like:

9: In y + x :
  longer object length is not a multiple of shorter object length

Sure it could be implemented as two embedded test() calls, but I assume there's a nicer, terser way.

Anyhow, I'll have to postpone stuff again ... somehow (no surprise there) I was the only person interested in this thing :)

PowerBI


So I found myself in a group of lads not so much interested in QuickCheck but Microsoft's new BI platform, thanks to Marczin for bringing up this.
The "big thing" here  was that it can operate with R, so that streams of data processing get combined with R's powers at statistical analysis and data visualization.

Disappointing, but this is for Windows users. At least the desktop version. As I'm not much of a business user, I was sort of sent away when I wanted to run things online... all right, the more of me remains for everything else then :) (everything else applauding)

We did manage to put things together. After dragging the fields of the automagically linked tables to the inserted R object's "values" list, and writing a tiny script referencing values in dataset$field style, R charts appeared. Actually, it wasn't at all as intuitive as was MS Access (2000/XP) ages ago, I mean, putting the data flow together took some time to work out, also I had some redundancy in one of my CSV's so we wasted some time on that, but it finally worked.

Microsoft R Open


What is more relevant is that it relies on Microsoft R Open, which in turn does work on various platforms. That I'll surely have to give a closer look.
I somewhat tend to forget what I saw: xgboost, for instance, comes up in my mind as something written in a low-level way, e.g. in C or C++ at the heart. However, if it isn't much so (for instance, matrix operations are still left to be carried out by R's native solutions), then R Open's capabilities could speed things up. Appears I'll have to see that for myself ...