Friday, 14 June 2019

5 things I hate about Pandas #1 (..#3)

Not sure it'll be a 5 element list, but frankly, to me with an OOP origin, some aspects of the Pandas multi-index support feel extremely ridiculous.

For instance this one.

Obviously, there may be consistency urges stemming from other areas (which?) but I would primarily expect a multi-index to allow for picking a subset of a data frame by reference.
Then it could be useful, at the very least for the sake of elegance ('data hiding') make it possible to subject that to data processing that has no reason to look elsewhere.
And so I wouldn't get confused when debugging etc.

Such as:

process_stuff1(df.stuff1_columns)

or at least, if it's a copy on 'read' reference

df.stuff1_columns = process_stuff1(df.stuff1_columns)

(I prefer the . notation over the indexer [""] for code completion purposes.)

And yes, no. Neither works. No, memorize.

I do wonder though how many find these not being "the approach" intuitive?
(Those finding the SO entry probably not.)

Well, I can live with it. (There's the can do = compromising attitude :) )

UPDATE: #2

Drop rows based on condition #20944


It's mid-2019 now. No comment.

UPDATE: #3

When aggregating a Pandas data frame, you may choose to specify a dictionary that assigns functions to columns. Say you aggregate a column of numbers by adding them up, but another column - you may wish to see the maximum value.
They may receive odd column names in the end -  like it can be obscure to name the total value column as that of the values being tallied up.
So you'd change the names, intuitively (df.columns = [...]).
Now, if you think the order is granted, you're wrong.
Python dictionaries by default are unordered, and so the order in which you defined their elements may get lost along the way.
So just generally  don't do that, or specify an order, or whatever ... the intuitive road is, anyway, blocked :\



No comments:

Post a Comment