Monday, 6 April 2020

os.system vs. line continuation on Windows :\

(note to self)

So at least in the Windows style (and of course actually on Windows), Python's os.system does not much enjoy the there accepted ^ symbols that otherwise are meant to mark lines of a batch file that would continue in the next, even if these are used for the purpose correctly -- and they otherwise work fine even if pasted into a command line shell.

Could be that it was the cr+lf's that halted the parsing at the end of the first line? Maybe someone will find out and publish.

For me, just one more lesson learnt the hard way ... :)

Monday, 2 March 2020

Structured and unstructured ... wait, what?

To date I find it difficult to align myself with the distinction between structured and unstructured data.

It feels like one of those cases when someone needed to categorize things for a practical task given their preferences, and without realizing their subjectivity they tried to cast those subjective boundaries into stone.

Now the industry, wouldn't be the first time, feels comfortable with the Simon sayid game, and tries to explain the stuff that lacked pragmatic reasoning, i.e. what structured and unstructured data are via examples instead of ways that would lead to measures, formulae, which could ultimately become algorithmically determinable.

One of the explanations is around the lines 'data that can be described by relational databases'.

Have I news to you - everything can be at least modelled via relational databases.

Now with models the case is that they're (almost) never expected to be 100%.

Well, I think what the authors there would try to mean then is that they can't give a 100% meaningful representation of the data, that would faithfully and storage-efficiently describe the relationships, in an easy-to-search way.
But then we could ask, okay, how do you measure how far we're from that 100%  faithfulness, and will that measure be good enough to make comparisons?
But in other words, we started to come back to how well the data can be processed.

And, some authors seem to start right there. Well-well, that's exactly what's always been a varying so far as we can judge from the known part of human history. Indeed "what is easy to process", that does change rapidly due to various relatively revolutionary innovations (from the Gutenberg's stuff through GPU to soon enough quantum computing networks).
So categorizations based on this criteria will change, drastically.

Maybe we just don't admit that unstructured data is just data without its structure unveiled, that we're not that ready process but generally it's almost the same thing?
Thereby "unstructuredness" is really a momentary property of the relationship between the observer and the data?


Not sure if a good example, but how would you classify a photo of an Excel spreadsheet? Then a screenshot? How is it worse or better than the data in the spreadsheet itself? And then what if someone pastes the Bible onto that very sheet? But then what if the sheet was like really big, and the Bible is just a tiny unnoticeable outlier within it :D

Wednesday, 22 January 2020

Win10 =? -2 cores

Just a few 10s of minutes ago I installed a Windows 10 enterprise on a VM.
Lo and behold, it's still eating roughly 2 (~ slightly deviating from the screenshot value, 60-70% on average) of  3 cores.
Er.... what on Earth for?
(The origin ISO was a quite recent download.)
I believe I do have hardware virtualization on my dev laptop, I tried waiting, I am too lazy to mention what I found on Google.
I never thought it would come to that, but praise the Lord for Ubuntu ;)

Friday, 3 January 2020

Stay nosy - notes about a couple of functional gadgets in Python

Mission briefing:

#1 lrucache() - does not always cache when you'd expect


(given an f(a): f(5) != f(a=5), or at least these mean two different calculations into the cache, and could even yield different values)

#2 partial() - you can still set the 'frozen' parameters


To begin with:

#1 lrucache()


It's a great thing to have something like this in the core set of a programming language.

However, beware: it will not consider an argument passed in as a keyword argument equivalent to the same argument passed in at its respective position.

A StackOverflow question is already dedicated to the topic - it was a bit of a surprise to me that this is so.

The below code demonstrates what I was at first surprised to expect to happen once I took a look at the lru_cache() implementation (for reasons irrelevant to the argument):

Next up:

#2 partial()


Now I must admit I am not a vigorous user of partial(), but I would have expected it to resist attempts to override the already 'frozen' (word from the Python documentation) parameters.  

"The partial() is used for partial function application which “freezes” some portion of a function’s arguments and/or keywords resulting in a new object with a simplified signature."

At least one risk here is that when part of a larger system, some code might change those values behind your back, breaking your assumptions.

I will try to remind myself to look at functions obtained via partial() as if they were simply provided a default value for some of their arguments.



Saturday, 28 December 2019

Saturday night fun

Lon dev's life == lots of joy!



    # software development is very recreational
    self.recreate_client()


Sadly this bit might be being deleted.


Monday, 28 October 2019

Pandas +1 thing to be nagged by

ipdb> df = pd.DataFrame(dict(x=[1, 2, 2, 3, 4, 5])); df.set_index(["x"], inplace=True); df.index.get_loc(3, method="pad")                    
*** pandas.core.indexes.base.InvalidIndexError: Reindexing only valid with uniquely valued Index objects


Without method="pad" things seem to work. Not sure I should flag this up, though the error message is at least a bit misleading. Then not sure if others find newer Pandas versions on some channel, mine is a 0.24.2, there are docs for 0.25.2 out there. Bug? Feature? I guess it's best termed as 'room for a feature'.

Yaaaay it's Mondaaaay :)

Then not sure if I'm completely right to just get around the problem using np.searchsorted() with roughly the desired effect, but in the given case "behaviour exactly as thought out" is not that indispensable. So I think I just will. Shame on me.

Thursday, 24 October 2019

Sure an anti-pattern? "Asking for forgiveness instead of permission"

Among Python's idiomatic approaches there is "It's easier to ask forgiveness than it is to get permission".

It may be a brilliant thought in context, but could possibly turn into a silver bullet when torn out of its habitat and forced into an over-generalized application. See, that isn't always a "nail" for your hammer.

What I mean is let's say I copycat a prime example mentioned here.
(Google cache link)

First of all, as of now the anti-pattern example is


Versus the recommended:


Now exception handling isn't difficult to get right, but that doesn't always happen anyway for whatever reason.
The recommendation is slightly (?) incorrect. The two implementations aren't equivalent at all.

The "pop-pattern" (popularized pattern) catches (and allows) a range of exceptions to take place. There probably isn't a difference made between say hard drive failures, read-only and non-existent files.

Of course if the file was read-only (from the perspective of the code at least, e.g. due to insufficient rights), whilst the rest of the implementation assumes it is deleted, then further unexpected situations are likely to arise, reducing the robustness of the software.

So there's the situation: we're misinterpreting a new exceptional case, and it may in the end just never arise in production, because the exception is not communicated. And what worse than a software defect that is not noticed ...

Or let's say you catch that very specific exception that is triggered when the file to be deleted does not exist. How can you be 100% certain it was the non-existence of that very file that triggered the exception? Yes, it's plausible to assume that the unlink (remove) code will not involve other files, but generally ... no, the code blocks still would not be equivalent.

This isn't my point, but I believe it points in the right direction.

The "pop-pattern" may make maintenance difficult

First of all, in real life these exceptions often are noticed in practice. "Ooops" says the programmer and adds a try..except block. Or better, in expectation, but that might require the extensive analysis of the documentation sometimes. That which might just not happen sometimes. Yes, people aren't perfect (and hopefully don't even try to claim to be).

So then the code may change slightly. Say from os.unlink to "remove_file(...)" in your own implementation, dealing with local/remote storages, whatever. You'll surely have to watch out, as the origin of the exception starts get farther and farther away from the handling code, as complexity in the exception handled code sneakily builds, your exception handler will be increasingly likely to get confused.

Had you used a particular error code - that would expressively restrict the code maintainers to provision for you as told. Well, hopefully. Shortly, my impression is exception handling can be more costly in the long run than to watch out for the individual situations, which are not necessarily clearly communicated in the forms of exceptions, and even if they are, the respective checks are counter-intuitive to make. I wonder how often people are seen actually testing for that filename member of the FileNotFoundError exception :)
(Ray of hope: there was at least one guy :) but a GitHub search suggests that it isn't the norm - and frankly, who on Earth has time for this?! outside this too long train of thoughts)

The fun bit

Quoting from the Python 3.8 documentation:
(Google cache link)



Remove (delete) the file path. This function is semantically identical to remove(); the unlink name is its traditional Unix name. Please see the documentation for remove() for further information.
New in version 3.3: The dir_fd parameter.
Changed in version 3.6: Accepts a path-like object.

os.remove(path*dir_fd=None)
Remove (delete) the file path. If path is a directory, an IsADirectoryError is raised. Use rmdir() to remove directories.
This function can support paths relative to directory descriptors.
On Windows, attempting to remove a file that is in use causes an exception to be raised; on Unix, the directory entry is removed but the storage allocated to the file is not made available until the original file is no longer in use.
This function is semantically identical to unlink().
New in version 3.3: The dir_fd argument.
Changed in version 3.6: Accepts a path-like object.

Briefly speaking: the exception that is caught in the example code block is not featured in the official docs for laymen. In other words, there is a potential for this exemplifying an "oops" said the programmer situation... in the given case it's probably the right exception to catch.

But (don't ever) trust me (but still give it a thought that :) ) you wouldn't like to always operate in this  trial-error fashion when provisioning code likely for the long run, and possibly for a sizable user base.

Conclusion

I'm not sure there's one :) But I'll try to come up with something
In my opinion this particular one is a dodgy anti-pattern: a good thought, but very easy to misinterpret.
Popularly mentioned anti-patterns should still be avoided with care. In my experience, exception handling should be as specific as possible (and I'm not sure if exception handling can be specific unless the framework the code relies on is restricted to be a specific version and/or specification), or bad things can happen.
Don't overcatch, don't overapply :)