Monday, 2 March 2020

Structured and unstructured ... wait, what?

To date I find it difficult to align myself with the distinction between structured and unstructured data.

It feels like one of those cases when someone needed to categorize things for a practical task given their preferences, and without realizing their subjectivity they tried to cast those subjective boundaries into stone.

Now the industry, wouldn't be the first time, feels comfortable with the Simon sayid game, and tries to explain the stuff that lacked pragmatic reasoning, i.e. what structured and unstructured data are via examples instead of ways that would lead to measures, formulae, which could ultimately become algorithmically determinable.

One of the explanations is around the lines 'data that can be described by relational databases'.

Have I news to you - everything can be at least modelled via relational databases.

Now with models the case is that they're (almost) never expected to be 100%.

Well, I think what the authors there would try to mean then is that they can't give a 100% meaningful representation of the data, that would faithfully and storage-efficiently describe the relationships, in an easy-to-search way.
But then we could ask, okay, how do you measure how far we're from that 100%  faithfulness, and will that measure be good enough to make comparisons?
But in other words, we started to come back to how well the data can be processed.

And, some authors seem to start right there. Well-well, that's exactly what's always been a varying so far as we can judge from the known part of human history. Indeed "what is easy to process", that does change rapidly due to various relatively revolutionary innovations (from the Gutenberg's stuff through GPU to soon enough quantum computing networks).
So categorizations based on this criteria will change, drastically.

Maybe we just don't admit that unstructured data is just data without its structure unveiled, that we're not that ready process but generally it's almost the same thing?
Thereby "unstructuredness" is really a momentary property of the relationship between the observer and the data?


Not sure if a good example, but how would you classify a photo of an Excel spreadsheet? Then a screenshot? How is it worse or better than the data in the spreadsheet itself? And then what if someone pastes the Bible onto that very sheet? But then what if the sheet was like really big, and the Bible is just a tiny unnoticeable outlier within it :D

No comments:

Post a Comment