By Leandro Ostera in knowledge engineering — Jun 12, 2020

AM002 – Engineering Knowledge: Identity

Let's analyze how we deal with information: its structure, its meaning, its identity, its provenance.

I was recently reading the last Software Engineering at Google book by Winters, Manshreck & Wright, and I really liked the distinction they made between programming and software engineering.

In short, and without spoiling you a pretty meaty book on the practices, policies, and tools that make software engineering possible at a gargantuan scale, software engineering is programming over time. If this is making your head work, go read the book.

What struck me was how I had been working with knowledge systems for a long time, and I hadn’t found a lot out there describing what the differences could be between one that was merely programmed, and one that was engineered for changing requirements, to be possible to operate, repair, and adapt with minimal effort and maximal impact.

In most places I’ve worked, I’ve found systems that are just thrown together to get some data in over HTTP, marshal it into some data structures, and off it goes into a database. The database is likely structured in a strict-ish manner, with some constraints being set by the schema, such as the uniqueness of some keys. The code that handles the data in and out maybe does some validation of it as well. You get the picture: the system has information about the information it processes throughout all its layers.

Now, when the requirements change, we have to go through several layers to make them work. Validation rules may change when inserting data, so the HTTP layer must reflect this. But because that is reflected there, the data moved around to the database doesn’t really say anything. In fact, it may need to stay flexible to support the old data that doesn’t pass this validation. Fair enough.

Other requirement changes, like a field that wasn’t supposed to be unique now being expected to be unique, can involve reprocessing the entirety of the data, pretty much recreating the database. If you used unique identifiers or monotonically incremental ones, you may end up redefining the identity of the data by changing only this requirement. If any systems that depended on this data held on to the identity of these data, they’ll have to be amended as well. If these are separate teams or departments, a small constraint change became an organizational problem.

So let's analyze what would the same core principle of “programming over time” would be applied to information in a system:

The structure of the information will change, be it in storage, transport, or other canonical representations
Meaning of the information will change, in fact, we only know what the data means now, to the current users and features, but new features could be built that rely on the same data but treat it so differently that it essentially means something else
Identity of a piece of information, such as a description of a home in a real estate market, maybe transient and internal, or global and immutable;
The provenance of information, or where it came from originally and how can we backtrack on it to understand its origins and evolution

I’ll try to cover these in the following weeks.

Today let’s talk about Identity.

From an early age, we learn to distinguish objects one from the other. In this process, we assign objects an identity. An abstract, conceptual value that is associated with only one of them, and can be used to distinguish them apart. Sometimes this value is concretized in some way: maybe you gave them pronounceable names, maybe you just kept them very separate from each other, or maybe you gave them monotonically increasing identifiers, as some kids do.

Now, what happened was that you and only you knew which object was what. Unless you shared and made very public that Rex was one, and then there’s “the other”, it would be pretty hard for anyone to figure out what you meant. Your identifiers were local.

In the same way, if you changed your mind and gave some objects new identifiers, the old ones went out the window and people would be confused as to what you are referring to. They’d have to re-learn the new object names. Your identifiers were transient.

Local, transient identifiers can be incredibly useful, but they are normally better kept for data that is not shared at all. E.g, I’ve got 3 guitars at home and I have names for all of them. Some of the names have changed, no one has been affected. So it's good to ask yourself in the context of a piece of information that needs to be identified as separate from a group, who needs to know about this?

If the answer is more than just “this single system” or “this one component”, then consider that local, transient identifiers may be useful in the short term but not scale well over time.

On the other hand of these, we have the ideas of global (as opposed to local) and immutable (as opposed to transient) identifiers. These are values that don’t really belong to anyone, and they can’t change, so we are free to spread them around freely, even publicly.

So all in, your choice is to either favor locally transient or globally immutable identification. False dichotomy, of course, since we can come up with a few hybrids, some better than others (please stay away from globally transient identification, it will only bring you misery).

Either information inherently belongs to a source of truth (some specific system’s database), or it inherently belongs to nobody, and thus we can build a single source of truth for it for practical matters.

I’m inclined to think that information either belongs to the organization building the systems, but no system in particular, or that it is external information that is actually owned by a 3rd party, even if the organization building the systems owns the representation of it.

One relatively cheap way of building strong identify for data across our systems is to start using Uniform Resource Identifiers [see RFC3986], which can, in turn, be used as a locator, a name, or both, instead of using incremental ids or randomized identifiers such as a plain UUID.

Consider a system that stores essays, maybe it is storing this one you are currently reading, and the ways it could assign ids to it:

Essay with id 002, since its the second I’ve published
Essay with id 1993388828, because that’s how many posts have been made globally across the entirety of the users
Essay with URI “am:essays:002”, where the namespace “am” stands for Abstract Machines, and the classifier “essays” tells us this URI points to data of a certain kind. Lastly “002” is a specific identifier that singles out this essay from all the other essays under this namespace.

It is easy to see how a URI here is vastly more descriptive and carries around enough contextual information to speak of not only what it is, but even who owns it. Many large companies that work with data have opted for using URIs to globally reference their information. Google, Spotify, and LinkedIn are some examples of this.

At the end of the day, a system that works with data using global, immutable identities has detached ownership of the data from the system and lifted it to the abstract machinery composed of the current and future implementation of the very same systems that process this information.

That’s one way in which we can engineer our way to a more resilient data and information system, so next time you’re about to let your database decide what the identity of something is, consider how will the lifecycle of this identity be and whether using a URI instead may be more fruitful.

References

Software Engineering at Google [book]
RFC3986: Uniform Resource Identifier [rfc]
RFC7320: URI Design and Ownership [rfc]
Web Ontology Language [spec]
An introduction to Description Logic [book]
A History of Clojure [pdf]

References

Subscribe to Abstract Machines