Saturday, 24 December 2016

Tools are not (just) tools

Note to self

I asked someone today whether he still likes maths, and got the response that maths is just a tool.

It made me think whether tools are actually just tools, or slightly more than that.

Accidentally I was about to study innovation management, which reminded me of adoption curves.

The Rogers adoption curve starts with the innovators. These people are the first to get acquainted with an innovation, use a new gadget, try out a new service, etc. Their characteristics include trying out things l'art pour l'art, which means they don't care about any practical value of those attempts.
Just toy with them.

I had two things to note:

1. these people are those who exercise themselves with things without a particular purpose

2. this is very much an analogue of how ADD people (in my imagination) get distracted by literally anything

About 1.


Actually what purpose do we have in mind when we are kids playing with things? Some made up one I think normally drives things, but this can be so minimally targeted as 'joy', 'fun', etc.
Sounds like a good idea to me.

And just because they have the "thing" around, they can play around with it. The presence of the item allows for another degree of freedom.
And people, like a gas, fill the space they are given.

So the new toy gets played with, new routine gets added to what was there before.

Tools as toys, become doors, and so is maths a door, doors get entered, and so will maths likely become part of your future path, once you've played with it.

About 2.


This is just a corollary - I would guess being overloaded with options just as well as overloaded with topics on the internet created its new addict group. Innovators are the ADD of the market.
Positively mad people. See 1. - allowing themselves options, they allow for creativity, by heading against discipline.

Now is a good time to ask - is discipline good or bad then?
Obviously, it depends.

Discipline and minimalism are just two sides of the same coin.
Capitalism and competition made us put things perhaps too much on the minimalist, the efficiency side.

You'll choose the offer for 999 coins but not for 1000 unless the cheaper is noticeably worse.
It's only the resolution of your value perception that needs to be tricked and the quality rot begins, a worse product sold for almost as high a price as the better one.

Discipline is a tool, not an objective. And an option, a freedom to choose. A tool, but once not a must, once you played with it without a need, more than a tool, too.

You can design bottom up as well as top down, and so you can create something without knowing the final design. Just because you didn't slap yourself in the face to get back in line before you'd have found another path.

We shouldn't forget to embrace freedom. And ADD makes people less controllable. A good thing, in some times.

Thursday, 15 December 2016

Is education going out of fashion?

Apparently people have finally found their ways to the schools ... or what's going on? They don't seem to search for institutions that much anymore.



It could be interesting to find out what alternative terms people have started to search for.

MOOC's are a very natural first thought. And they have clearly gained interest over time...



Are they all that significant?



Indeed.

So perhaps the above does not mean education is getting out of fashion, maybe as expected, it is actually getting to play a more and more significant role - but who knows.

The interest in academic degrees seems to nearly stagnate, although there is a likely shift towards shorter degrees:

Sunday, 16 October 2016

Idle priority: data project contributions

I will mention two here, so that I won't forget them. One is new, that sounds more interesting (of course, new things, always ...)

Amnesty Decoders (update: false trail...)

"Join a global network of digital volunteers helping us research and expose human rights violations."


This is actually something that deceived me big time:


They want you! To click on sections of satellite imagery where villages (artificial structures) are present. Hm... I would have thought people (with some proficiency) can contribute to the image analysis with code :(
I wonder why these projects don't end up on Kaggle. Or do they?
This task is absolutely crying out for automatisation ...



... and another one, which I actually started doing something with, from the past:

Open Corporates



It is a large database, commercially available for companies - that's how they make their living. It is backed by paid coders as well as enthusiasts, their meetup if someone wants a bit of hacking for good (they do/did provide free means of accessing the data as well, especially for contribution) is at:


I never found the time again, so my bot is still 'booting', but would have been nice to. Should.

The task is to create bots, in Python, which do the web scraping (in my days, it happened with BeautifulSoup), and crack up the data (through multiple steps) to extract relevant attributes.

Their stuff is helpful for those doing data journalism, and there are hopes it can help to cut back and/or eliminate corruption hidden behind the ownership graphs.

Friday, 7 October 2016

Why or why not Go [draft]

Gnack, language is a matter of choice, but the choice is a matter of the market, etc.; so this year as usual:
-- Away from me, Pascal!

So what language?


Starts to be apparent that some companies just do choose newer technologies, despite the safe player masses, although it's not always obvious when looking at the overall trend. Workforce performance (I mean code * quality / dev_hours or something similar) increase? I wonder. Secrets, never told.

One example: I've been trying to find the reasons pro and con for learning Go programming, as I'm not a massive C++ fan, but I know I like Python and, after a quick assessment, it didn't seem to be adding a lot over it. Still I doubted my judgement on this (being a master of neither of the two), so I looked at Google Trends and didn't find anything promising. Even Google didn't market it too well :)



However, just today I got reminded by a lecture of ItJobsWatch's charts, and see these UK-wide statistics! (Numbers at the front are approximately correct as of 8/10/2016)

1%: Go on ITJobsWatch

That's a small outbreak :) Mind that, my earlier investments, R and Python have gone way way up, too.

1.5% R on ITJobsWatch

14%: Python on ITJobsWatch

This is how it works! Or rather when it works.

Then add that Docker is written in Go, that CloudFoundry and RabbitMQ and that things people write for people to use, have started to recognize the power of this language, the result starts to get much nicer.

UPDATE:
On Quora they found that the Golang expression statistics show a nice, steep upward Google trend. I'd add that it's never guaranteed that it's a sign of success, while people realize that this expression works, the apparent explosion is possibly just a manifestation of the overtaking of term search numbers from other, related expressions (sort of a cannibalisation). "Golang" to date is backed by fraction of the searches for "Go (programming language)" and "Google go".

The Bug #1: Windows not that much love Go


However, there's this little bug, which still prevents (at the time writing) building Windows DLL's...

So then it doesn't seem like a "platform-neutral" development attempt, but on the other hand, something that has a brand starting "accidentally" like that of Google, forever, and not supporting full scale Windows development. Hm :) Me? Not suspecting a thing. Good question: who knows when?

TODO: Would be nice to have a time vs. number of comments/total length of comments :) when the hell is it going to get closed?

Update: The Bug #2 - unfriendly again, Python would love GoLang if...


So as "Bug #1" says, you'll have to think before building Windows DLL's with Go. Although from the epic talk on the bug page it seems it's only affecting multi-threaded code via the Windows TLS support. However, something like that should work, I guess ... especially if you use Go which prides itself of its multi-threading support.

So (or without noticing why) on SO they compromise on building .so's for Linux.
http://stackoverflow.com/questions/12443203/writing-a-python-extension-in-go-golang

But who wants to really create a *nix-only Python package, or one that may not be extensible at some point - further from internal use? (Think of Anaconda - I guess it's a blocker for more official python distros.)

Then there's gopy also mentioned, for making importing trivial. However, it's still not compatible with go >= 1.6. Even if it's a very good start to creep in to commonplace use as an extension language first.

And Linux is still not even nearly everything.

So seems like there's a little longer while to wait for the ecosystem to get ready for broader market penetration. Exciting moments anyway.

Surely more promising already on the server side!

TODO: Mention fun language - intrinsic motivation - creativity association, weakness in analytics.
TODO: Mention recently created Go on GitHub charts? Create new ones?

Thursday, 6 October 2016

An entropy paradox

Entropy/diversity is personally one of my favourite brain-wasting topics (I have my reasons for that, good ones :) ). Here's a short line of thoughts that illustrates the tricky nature of these notions. At the end, I'll probably not connect this back to naive everyday thinking which everyone is taking for granted, as it would be either embarrassing for many or worse even (in case I'm wrong), very embarrassing, but only for me alone :) But beware I think I could (make a fool out of myself)...

So the opening thought is that when we are young, we are more similar to both of our parents, and as we turn older, we will exhibit stuff related to the matching gender parent, i.e. specialize.

Now let's describe this by something that behaves like the Herfindahl index (any entropy measure does the job in one way or another) over a pair of similarity metrics. We'll find that from a pair: (mom_similarity, dad_similarity) closer to (0.5, 0.5) the individual's stats move closer to either (1, 0.0) or (0, 1.0), and that this means this index will increase, suggesting a less diverse individual.

Now let's take a look at this on the macro level, multiply up the aforementioned individual so that it becomes a population (of roughly indetically aged, random gender people which is growing up)!

From a series similar to [(0.5, 0.5), ..., (0.5, 0.5)] we observe a transition towards (assuming equal probabilities for the genders) a series that is more like [(1.0, 0.0), (0.0, 1.0), ...] etc.

What we then find is that the diversity on the macro level did exactly the opposite - a population of randomly grown ups is more diverse than that of babies. Actually that was quite trivial without the numbers already, but the joy and the words with the weird spelling ... :)

So yes, almost paradoxically, micro and macro level entropy may work against each other - the level of abstraction does matter a lot!



P.S.: Don't calm down. I look forward to distribute similarly useless thoughts in the future.

Tuesday, 28 June 2016

R: a quick permutation sightseeing

Listing permutations of a number of distinct elements is surely doable, as long as the number of elements is not too big. But with only a little pickiness, such an innocent task can raise many interesting (but small) challenges to deal with once everyday expectations (memory, heap, CPU efficiency) are respected.

I found myself facing this task which involved a list of permutations of 5 elements (related to the Draper Satellite Image Chronology competition on Kaggle), and while it's truly simple, I thought I would just have a peek at how far the common sense has progressed with it.

Answers to a StackOverflow question should tell something about it.

There were
  1. the brute force exhaustive search approach, probably the simplest of all (which is probably the most suited for my original motivation)
  2. the classic recursive implementation
  3. Museful's much more vectorized ("matrixized") solution, a variant of which had too crossed my mind (unfortunately the SO one is just as good if not better than mine would have been, so I think I shouldn't waste many words on it)
and some others which seemed conceptually equivalent to either of the above.

The problem was pretty much solved, but my brain kept ticking... so there is one more I would like to take a note of here.

Firstly, I did deal with this before, a very long time ago, and I had a solution which I annoyingly forgot about, but it was easy to recreate something similar.
That key concept was that (over the same elements) a permutation is only different from another one by (a number of) exchanges between pairs of elements.

Actually, that's almost too easy to see, as just by always exchanging a misplaced element with the one with the right number, and repeating it, in no more than n exchanges (actually n-1 is sharper, and probably the average number of minimum required steps is much lower) we should get there.

E.g. have 12345 want 54321

exchange 1 & 5, get:

52341

exchange 2 & 4, get:

54321

done.

In other words, just by exchanging elements, we can go from one permutation to the other.

So this is probably an option to list all the permutations just as well.

Going from 12345 to 12354 is a simple move: replace the last two. To get to the next meaningful step, a third element surely needs to be involved. Then we may get back to varying the first two. Then the elements in the most frequently varying pair need some fresh blood again.

To see what I mean, the pair variations over a given 'symbol set' may be:
  1. 12345 and 12354
  2. 12435 and 12453
  3. 12534 and 12543
Giving the first 6 permutations. The generalisable recipe of items for the above starters:
  1. don't do anything,
  2. replace the 3rd with one element from the pair,
  3. replace the 3rd with the other element from the pair.
The flow goes as:

This may still feel a bit unclear, but technically from the permutations of (n - 1) elements we get those of n elements by placing the new element into different positions within all n slots, which means in the first case it's not involved in the last (n - 1), then it substitutes a different one of those (n - 1) in each "big" step. The last (n - 1) elements are just shuffled around as usual.

Slightly modifying it (reverse order) for the ease of coding, we get the below implementation.

I was really happy to sketch this out, although it's not much of an improvement. If I really want to be nice to myself then it is, as it does not have to be recursive, surely has a light heap footprint (the momentarily state of the algo is stored in 2 * n integers), but otherwise isn't too interesting, at least in R, where vectorization over tons of short loops is much preferred. It could possibly be a good choice in C(++) if someone's bored enough.

Then I Googled permutations with pair switches and got to a really interesting PDF.
Their solution assigns a 'momentum' (direction) to each number at each step, and I don't fully understand it (yet). Powers beyond my control :)
But surely is a simple one and is more efficient in that this one doesn't put back the items from where they came from, i.e. only applies half of the switches. However, it does have to find the largest movable element.

So do I win or what? :) Maybe only as long as I resist the temptation to read beyond the first page of their stuff... I'd say, for the temporary illusion, not today! Sure they have something better.. :)

Monday, 13 June 2016

R: the "is ordered" exercise

Some programming treats look nicer to me, one of them is determining whether a series of numbers is ordered in one way or another.

It can come useful when trying to decipher some shuffled data set, which mainly strikes me on Kaggle. Albeit with a little flaw it could be done 'on the spot', i.e. trivial solutions are easy to implement, it seems like an interesting exercise if vectorization is a requirement.

And I'm trying to collect such things here and keep my brain moving, so I created a solution which I'll give below.

The function

Not much (just at a proof of concept level), but tested: link to the test code. Obviously if anyone has a better idea for the implementation, please give me a shout!

Sunday, 15 May 2016

Animations in R (part 1)

A sometimes forgotten feature in R is animations. This can add one more level to the visualizations, although it can't compete with proper dynamic pages. Whenever generating a phase on the go would be too demanding, however, as is frequently the case with some prototype or one-off, less performance optimized algorithm, it can still be the way to go.

#1

Just to remind myself that it's possible, here's a sinus plotter clone in an animGIF:

Obviously when created for the web it needs to be periodic within a few hundred frames, in the absence of the implied constraints it can be more flexible. By the way it's originally a microprogramming classic (to me, at least), e.g. this one was an example. Beyond that, its history spans way back, see the Lissajous-curves, for instance.

 

 

#2

Alas, this one's a video and I have little control over the positioning, but here we can see how the nature of an xgboost forest based prediction responds to more iterations (= new trees) for a binary classification problem with a noisy training set, and how much of the problematic overfitting can be prevented by reducing the maximum tree depth. Of course, it's only overfitting as long as those stand-alone points, constituting a few percent of the whole training data, are really noise, and not valuable information! Apparently, the max. depth is similar to some sort of a filtering parameter. I showcase the result of the training re-evaluated over the training set for depths of 1, 2, 3, and 4 below.

Obviously someone with a slightly mathematical mindset could then ask - why don't we allow for fractional depth values? Since that's the parameterization style the most common image processing filters allow for. Sharpness, blur amount etc. - it's too common to miss a limitation.
My thought there is that max.depth accomplishes more than one thing, namely at least 1. poses an expectation towards what can be effectively considered a significant cluster of true values 2. controls the complexity of rules (like if (v_i_1 and v_i_2 and .. v_i_max.depth) return(1)). This can both be good and bad, but a) some separation is worth a thought, e.g. filtering as a preprocessing step and b) in practice the preferred values may change by location, so it can be a very long story...

... and that's the intro for now, I have some plans about returning to this one over the future:





[To be continued ...]

Sources:
#1
#2

Wednesday, 11 May 2016

Yesterday's R dojo

Have been to an R code dojo meetup yesterday. These meetups are to let people try out & practice things in smaller groups, typically of 4-6. I tend to value any practical one over the classic presentation + Q&A ones as it's been not once communicated to me that frontal education has had shown the poorest performance, so despite it's price: avoid it like the plague. Prefer events where practice is involved!

QuickCheck - another testing method/philosophy


First of all, I wanted to delve into QuickCheck I heard about years ago. To me this one is more on the integration testing side, still a decent one (actually - what isn't a bit of integration testing?)

I believe the stochastic behaviour of many ML routines almost mandates having repeated tests - a single test case being passed is not convincing on its own, just as no one really trusts an algorithm that forecasts well once (but perhaps never anymore). And this is just one side of the coin - the output normally cannot be exactly known, but approximate values, with error bands, are typically acceptable (actually, I expect this could get as far as putting proper statistical testing in place). Combine that with an output of several values; 'narrow' rules of self-consistency instead of in-depth byte-to-byte specification of the output becomes an appealing option.

On the inbound train I got to installing it

and finding the steps at https://github.com/RevolutionAnalytics/quickcheck, then doing a

gave 2 more examples.
The examples are like (this one is from the quickcheck package vignette):

Unary function, unary output. What could have been a little more interesting to me, was defining a bivariate function - since it's operating on a much larger input space (+1 dimension), where random sampling of the space becomes one reasonable approach to somewhat evenly testing on the volumetrically exploding set of input states.

I had exactly 0 luck there:

Some - at first, incomprehensible, recycling occurs and I'm flooded with warnings like:

9: In y + x :
  longer object length is not a multiple of shorter object length

Sure it could be implemented as two embedded test() calls, but I assume there's a nicer, terser way.

Anyhow, I'll have to postpone stuff again ... somehow (no surprise there) I was the only person interested in this thing :)

PowerBI


So I found myself in a group of lads not so much interested in QuickCheck but Microsoft's new BI platform, thanks to Marczin for bringing up this.
The "big thing" here  was that it can operate with R, so that streams of data processing get combined with R's powers at statistical analysis and data visualization.

Disappointing, but this is for Windows users. At least the desktop version. As I'm not much of a business user, I was sort of sent away when I wanted to run things online... all right, the more of me remains for everything else then :) (everything else applauding)

We did manage to put things together. After dragging the fields of the automagically linked tables to the inserted R object's "values" list, and writing a tiny script referencing values in dataset$field style, R charts appeared. Actually, it wasn't at all as intuitive as was MS Access (2000/XP) ages ago, I mean, putting the data flow together took some time to work out, also I had some redundancy in one of my CSV's so we wasted some time on that, but it finally worked.

Microsoft R Open


What is more relevant is that it relies on Microsoft R Open, which in turn does work on various platforms. That I'll surely have to give a closer look.
I somewhat tend to forget what I saw: xgboost, for instance, comes up in my mind as something written in a low-level way, e.g. in C or C++ at the heart. However, if it isn't much so (for instance, matrix operations are still left to be carried out by R's native solutions), then R Open's capabilities could speed things up. Appears I'll have to see that for myself ...

Friday, 8 April 2016

Working on a longer coursera project locally on Lubuntu

Today I decided to return to the practice of using a VCS due to the data science capstone project's growth. I have repos here and there, but somehow they just don't want to perfectly fit my needs (there's either size limit, lack of privacy or I'd have to pay).

I already believed I needed a local VCS since on longer tube journeys and flights I may not have internet access.

I supposed I'd choose git as I was hoping to deepen my experience with it, so that I can get more effective with the online hosted version as well - just about everything seems to happen on GitHub these days.

The client


I first checked out the clients - at least those available on Linux. I so far haven't come upon those feature rich UI's that I was used to on Windows, but I still demand something similar, so it was crucial (those tiny little annoyances which often go unnoticed, can really slow down work).

Formerly I didn't have much luck when choosing the weapons for Subversion, back then I couldn't find a *nix match for the handy TortoiseSVN. I have RapidSVN but the version I have is exhaustingly bogus.

As git I assumed is trendier, I had better hopes for better maintained clients. I found the default range on this page, and with my criteria in mind (free, open-source, working, preferably cross-platform) this quickly narrowed down to git-cola.

It appears good enough (although it froze a couple of times while authenticating with the server in my tests), however the aptitude version is way outdated (or at least its about dialog, pretending to be from 2012), so I had to uninstall that one (apt-get remove git-cola), and install it from source.

It is also written in python, which I'm learning as time allows, so the sources might give me a lesson or two on that.

The "server"


While some git components, especially the CLI clients should have already been present on my system, I checked the official installation steps and kicked off with a further

sudo apt-get install git-all

just to be sure ... further from that, after a little thinking and experimenting, it turns out that the client tools with a local repository are probably all I needed to get going with my local project.

... and it works!


I only had to initialize a new repo, and it was working straight ahead.
I mean, after untarring, it already works with

~/Downloads/git-cola-2.5/bin/git-cola

if someone enjoys inconvenience. I decided to call it a day - TBC.

-----

And as a little perk for the night, I accidentally (was queued for later in preference over some Netflix whatnot...) watched a TED talk with Linus Torvalds which surprised me with mentioning his share of creating Git as well as Linux. Nicely done!

Update

make prefix=/usr install

executed from within the git-cola directory (~/Downloads/git-cola-2.5 according to the above) does the job, so that git-cola can be now launched with ease from the terminal, and also gets available from the "Programming" menu group.

Sorry about the formatting, at this point I'm lazy.

Wednesday, 6 April 2016

GitHub and R

I've been to a workshop preceding a meetup where the basics of how GitHub can be conveniently used with R Studio was the topic.

One benefit: I finally got my hands a little dirty with creating GitHub issues (although to date I tried to avoid that ... partly because I only ever had public repositories and possibly I didn't want to share every everything, especially not exemplary beginner's mistakes :) ).

Another one: Being reminded of continuous integration. One day I'm sure it'll come useful. The only problem with it is it didn't fit into the workshop, but here are the related slides for that one day:

http://mangothecat.github.io/github-workshop/05-best-practices.html#continuous-integration

And, say, the third: Catching sight of the "git" tab in the R Studio GUI, once that's sufficiently enabled for the project. Again, one fine day ...

Monday, 7 March 2016

R: the window diff exercise

The implementation of moving/sliding window transformations is one of the basic duties one may come across while developing various monitoring, time series analysis or signal processing tools. Third party libraries probably embody solutions, but it's a classic so I'll devote an entry to it.

An example could be calculating moving average over a time series. This pattern popped its head up again at an R coding dojo (in about 2013 ... that's how fast I write :) ). The concrete scenario has something to do with the fingerprinting that some antivirus software perform in the heuristic search for high-entropy chunks in a data stream (file), which if found, suggests encryption was applied, possibly by a hiding virus (Derek's Shape Of Code entry has more about it).

Moving away from his original topic - where the brevity of R does a better job than a lengthier code, it appeared to me this could be a nice example to illustrate a performance optimization concept which can come handy at other times, certainly in some more performance-oriented scenario.

Notations


n: number of samples (say ~ 1M)
m: window size (say ~ 1k)
k: a numerical constant (the point is it is independent from the parameters)

The original approach


cost ~ n * k * m

Thinking of the aforementioned R code, that one browses through all possible positions for time windows of a certain size, and computes the ratio of uniquely occurring symbols to the size of the time window for each. The excerpt I'll focus on:

This calculation can become massive, as its cost (in that form) is proportional with the window size multiplied by roughly the number of window positions - sapply() executes for each relevant X, and calls unique() with a slice of the input data. The procedure examines most input symbols in the stream window.size times. Is it necessary? (Some relief is that the data is recalled from the CPU cache already, but with a little thinking we can come up with something better.)

Although R may not be the most efficient environment for running such code (as loops are preferred to be replaced with array operations, which could allow even for implicit parallelization), I will show how it is possible to reduce the complexity of the algorithm from somewhat an n * m to somewhat of an n + m (asymptotically) by applying a simple way of thinking twice (and I can tell that count.positives() accounts for the multiplier of "m").

Remark: actually, the original considers only each 5th one, but I'll take another route here, and so will ignore this.

First improvement


Assuming instead that we have knowledge of the first frame, it is possible to only take into account the bits that leave the window and those that move into it, namely the symbol that gets excluded on the left, and the one that gets included on the right. So, by focusing on the right items, the computational complexity (however, not in an asymptotic sense, yet) can be reduced.

cost ~ n * k2 * m + m * k3, main point is k2 < k (expected)


The frequency[] array holds occurrence counters inside the window. Once initialized, it is only maintained at a low(er) cost per window, hence the memory read operations become less redundant - although each sample is accessed twice.

Second improvement


Now one might go back in thinking to that the subject examined is basically the number of those symbol counters which are non-zeroes. When changing a counter it may either enter that set, or leave it (i.e. change to zero). Therefore, once we have a starting point (described the first window) it is sufficient to focus our attention solely on the values which significantly change - were positive and are becoming zeroes (meaning the related symbol is no longer present in the new window), or were zeroes and are becoming positive (meaning the related symbol appears in the new window).

So similarly:


cost = O(n + m), m for initialization, n for each step

This now has the promised characteristics.

As a note, when sliding a window over floating point values, e.g. for the moving average calculation, one should note that a value may have an impact of a different precision when entering and exiting the window. This may lead to accumulating numeric errors, and in the end, have a significant impact on the value computed for the last window position. A similar problem emerges if the data changes in the magnitude of variation over certain sufficiently lengthy intervals. But the above are quantized values - integers on each level of abstraction, so no rounding takes place, and this implementation caveat is naturally avoided.

At the finish line


The output of the code on my laptop is as below:

   Original  First imp. Second imp.
      5.846       3.443       2.051


Apparently, each iteration improved a little on the performance, over 2.5x in the end, although this number is not relevant. A true maximalist could go on for an rpp implementation, making it really fast.
Part of the reasons is that such an approach could further benefit from the unfair advantage of a C++ compiled parameter passing, which the first code cannot utilize (however, that one does have/tries to leverage the advantage of vector calculations to some extent).
Yet, the above is rather a demonstration of the concept than an über-optimization attempt, so no C++ comparison for now. The lucky thing (due to there are many constants tampering with the end results) is that even if the two improvements work with for loops, they prove faster in 'at least one' realistic run.


Closing words: this entry was partly written to exercise my writing skills, also as a follow up from years ago, however little significance it might be of, I liked creating it. The code somehow reminds me of code dojo exercises. (Not sure if this pattern could make it to be a Code Kata I heard about a good while ago?  One more thing I'll have to check out anyway, at least.)


To those who made it this far: thanks for reading!


The complete R code is available here.

Tuesday, 16 February 2016

R: why using require() over library() shouldn't be thy default


This is a very simple thing, and really isn't worth wasting too many words on it (but I will! :) ) The point I'm making is right there in the CRAN documentation on these functions:

"require is designed for use inside other functions; it returns FALSE and gives a warning (rather than an error as library() does by default) if the package does not exist."

The trend seems to be that people in increasing numbers started to prefer require(), however. In the one-off recipes, which are a relevant portion of what's being created with R, this doesn't make much sense.

For instance, at the beginning of one of XGBoost's demos, you'll find this:
require(xgboost)
require(Matrix)
require(data.table)
if (!require(vcd)) {
  install.packages('vcd') #Available in Cran. [...]
  require(vcd)
}
Created by Pretty R at inside-R.org

The first 3 require() calls do something. Then something else happens. Then some output, the main point of the demonstration is printed. The warning messages, possibly thrown in the header, will easily go unnoticed among the output lines, unless someone's keen to scroll up to the top.

So let's get together some reasons against the overuse of require():

#1 Delayed feedback. Fatal problems better turn out at an early point and an obvious way if it comes at no extra cost. Errors from require() may take extra effort to find.

#2 Risks. A system silently executing on with problems nobody expected, is always a jeopardy. It would be even more interesting with function name collisions - once the correct library() is missing, a standard function in the environment may get used instead of what the author hoped for.

#3 Copycats (education). Furthermore, the reader/user of a tutorial code does not necessarily have that much routine and is easily deceived for a while - and while that period lasts, this bogus practice will be copied all over the web, in Kaggle scripts and blog entries (like this one ;) ).

As a side note, in the mentioned code, the lucky thing is that the packages are later on used at an early point in time, so the problems will likely turn out at an early point anyway, data.table creation and similar things make this happen. But this is far from making the code correct.

Sunday, 7 February 2016

GPU, Lubuntu, ML - bloody roots

I have presented myself with a CUDA-compatible VGA card for Christmas (Asus GT730) so that I can check out Theano and GPU computing in general. No matter the appetite, having the appropriate amount of business to deal with, I was conveniently procrastinating the installation. Could be good intuition :)

I tried reinstalling it a few times, until I found that some RAM-drive for whatever reason needs to be purged:

sudo update-initramfs -u

Running this at the beginning & also the end of the previously attempted installation steps, I believe it's "there" now, although it was missing a couple of things.

When running Octave, it was still complaining for not having a video driver in place

Xlib:  extension "GLX" missing on display ":0".

A page said that adding a couple of lines to the code (it was the Coursera Machine Learning course's ex3.m) can remedy the situation ... and in fact I only needed to insert the line

graphics_toolkit gnuplot

right after the initial clean-up:

%% Initialization
clear ; close all; clc


and that solved this one of my problems. Or I could at least see that initial plot.


Some of the Cuda examples are broken now, though. Not sure why (perhaps just a missing path), but when running

MonteCarloMultiGPU

I'm getting this:

error while loading shared libraries: libcurand.so.7.5: cannot open shared object file: No such file or directory

... thank you so much, laters, then ...