Monday, 7 March 2016

R: the window diff exercise

The implementation of moving/sliding window transformations is one of the basic duties one may come across while developing various monitoring, time series analysis or signal processing tools. Third party libraries probably embody solutions, but it's a classic so I'll devote an entry to it.

An example could be calculating moving average over a time series. This pattern popped its head up again at an R coding dojo (in about 2013 ... that's how fast I write :) ). The concrete scenario has something to do with the fingerprinting that some antivirus software perform in the heuristic search for high-entropy chunks in a data stream (file), which if found, suggests encryption was applied, possibly by a hiding virus (Derek's Shape Of Code entry has more about it).

Moving away from his original topic - where the brevity of R does a better job than a lengthier code, it appeared to me this could be a nice example to illustrate a performance optimization concept which can come handy at other times, certainly in some more performance-oriented scenario.

Notations


n: number of samples (say ~ 1M)
m: window size (say ~ 1k)
k: a numerical constant (the point is it is independent from the parameters)

The original approach


cost ~ n * k * m

Thinking of the aforementioned R code, that one browses through all possible positions for time windows of a certain size, and computes the ratio of uniquely occurring symbols to the size of the time window for each. The excerpt I'll focus on:

This calculation can become massive, as its cost (in that form) is proportional with the window size multiplied by roughly the number of window positions - sapply() executes for each relevant X, and calls unique() with a slice of the input data. The procedure examines most input symbols in the stream window.size times. Is it necessary? (Some relief is that the data is recalled from the CPU cache already, but with a little thinking we can come up with something better.)

Although R may not be the most efficient environment for running such code (as loops are preferred to be replaced with array operations, which could allow even for implicit parallelization), I will show how it is possible to reduce the complexity of the algorithm from somewhat an n * m to somewhat of an n + m (asymptotically) by applying a simple way of thinking twice (and I can tell that count.positives() accounts for the multiplier of "m").

Remark: actually, the original considers only each 5th one, but I'll take another route here, and so will ignore this.

First improvement


Assuming instead that we have knowledge of the first frame, it is possible to only take into account the bits that leave the window and those that move into it, namely the symbol that gets excluded on the left, and the one that gets included on the right. So, by focusing on the right items, the computational complexity (however, not in an asymptotic sense, yet) can be reduced.

cost ~ n * k2 * m + m * k3, main point is k2 < k (expected)


The frequency[] array holds occurrence counters inside the window. Once initialized, it is only maintained at a low(er) cost per window, hence the memory read operations become less redundant - although each sample is accessed twice.

Second improvement


Now one might go back in thinking to that the subject examined is basically the number of those symbol counters which are non-zeroes. When changing a counter it may either enter that set, or leave it (i.e. change to zero). Therefore, once we have a starting point (described the first window) it is sufficient to focus our attention solely on the values which significantly change - were positive and are becoming zeroes (meaning the related symbol is no longer present in the new window), or were zeroes and are becoming positive (meaning the related symbol appears in the new window).

So similarly:


cost = O(n + m), m for initialization, n for each step

This now has the promised characteristics.

As a note, when sliding a window over floating point values, e.g. for the moving average calculation, one should note that a value may have an impact of a different precision when entering and exiting the window. This may lead to accumulating numeric errors, and in the end, have a significant impact on the value computed for the last window position. A similar problem emerges if the data changes in the magnitude of variation over certain sufficiently lengthy intervals. But the above are quantized values - integers on each level of abstraction, so no rounding takes place, and this implementation caveat is naturally avoided.

At the finish line


The output of the code on my laptop is as below:

   Original  First imp. Second imp.
      5.846       3.443       2.051


Apparently, each iteration improved a little on the performance, over 2.5x in the end, although this number is not relevant. A true maximalist could go on for an rpp implementation, making it really fast.
Part of the reasons is that such an approach could further benefit from the unfair advantage of a C++ compiled parameter passing, which the first code cannot utilize (however, that one does have/tries to leverage the advantage of vector calculations to some extent).
Yet, the above is rather a demonstration of the concept than an über-optimization attempt, so no C++ comparison for now. The lucky thing (due to there are many constants tampering with the end results) is that even if the two improvements work with for loops, they prove faster in 'at least one' realistic run.


Closing words: this entry was partly written to exercise my writing skills, also as a follow up from years ago, however little significance it might be of, I liked creating it. The code somehow reminds me of code dojo exercises. (Not sure if this pattern could make it to be a Code Kata I heard about a good while ago?  One more thing I'll have to check out anyway, at least.)


To those who made it this far: thanks for reading!


The complete R code is available here.