Metadata: curse or cure for GDPR compliance?

Deleting every single digital reference to anyone is a logistical nightmare. The solution? Metadata.

noun, plural in form but singular or plural in construction | meta·da·ta | -'dā-tə , -'da also -'dä- \

Metadata allows organisations to map and locate everywhere that the various bits of data it holds about an individual resides, identify the privacy-sensitive data items, and helps organisations determine if, when and how a hack occurred.

The European Commission defines personal data as “anything from a name, a home address, a photo, an email address, bank details, posts on social networking websites, medical information, or a computer’s IP address”. Any kind of sentiment mining, opinion mining or tracking of customer behaviours such as click-through rates, browsing history, likes, shares, comments, bookmarks or endorsements, trigger GDPR concerns and regulations if these activities are or can be used to identify a private person.

While GDPR uses the presence of large fines as a motivating factor towards universal compliance, no regulation can capture more than a small fraction of the ways in which a natural person can be identified, regardless of their level of explicit anonymity. As human beings are fundamentally social animals, anyone on the grid is transparent, whether directly or indirectly, by way of their own activity, or the activity of those to whom they are connected. Metadata makes it all possible.

The curse

However benevolent the aims and vision of GDPR may be in creating a common-information market for the EU, it is - by the very nature of connected data today - limited in its ability to effectively defend its citizens against data mining, which may very easily lead to de-anonymisation.

GDPR promises to limit companies from hitting the nail on the head and abusing the rights of individuals to control of their data; it does not, however, stop an enterprising data scientist or rogue nation from hitting everywhere else, through which any nail is quickly exposed by way of absence. To provide a contemporary example, a 2017 study by Stanford and Princeton University, De-anonymizing Web Browsing Data with Social Networks. found that “browsing histories contain tell-tale marks of identity” that alone can be used to determine a user’s Twitter account even in the absence of having ever tweeted.

The method could determine over 70 per cent of users from the set of 400 volunteers using 30 links originating from Twitter alone, and rose to 86 per cent when 50-75 links were provided from their history. This means, without any identifying information whatsoever, individual identity can be inferred from simply providing web browsing history alone.

Another feature of modern life however is the ubiquity of GPS sensors in modern phones. Work by the MIT Media Lab under Alexander Pentland (see his excellent book Social Physics for further reading) has explored the numerous ways in which basic GPS data and credit card histories can de-anonymise even the most astute defender of personal privacy. In this spirit, a 2015 study entitled “Spatio-temporal techniques for user identification by means of GPS  mobility data” found that GPS data was a powerful source of identifying individuals even in the absence of personal information, based upon the fact that humans are fundamentally habitual. Indeed, the researchers found that “as little as two spatial points are sufficient to uniquely identify nearly 100 per cent of the users”.

In addition to the more indirect ways in which de-anonymization can take place, as social media-based news feeds in the 2016 US elections and the UK’s Brexit vote have clearly shown, humans like humans who are similar to themselves, a principle known as homophily.

This desire to both believe in and engage with those similar to ourselves is the very principle that has led to a system of reinforcing existing beliefs at all costs rather than challenge them and promoting critical thinking, whilst also making us all the more transparent.

Homophily aside, internet based businesses themselves have been gathering and tagging individuals in order to increase their advertising revenue as a new study entitled Facebook Use of Sensitive Data for Advertising in Europe has revealed.

The study found that 73% of European Facebook users have been tagged with sensitive interests by the social networking behemoth, translating into 40% of the EU, or around 200 million citiziens.

Considering why we are attracted to social networks, and our inherent biases with regards to people and opinions that match our own, the challenge that we are faced in terms of our ability to be truly private persons, seems to be the very thing that has made us successful over more physically imposing species: our social connectivity.

What are metadata?

While metadata have often been described as “data about data”, it is arguably more accurate to think of a piece of metadata as “a statement about a potentially informative object” (According to Metadata by Jeffrey Pomerantz 2015). In this definition, metadata provides context around a data object that may, given the proper context, be informative and thereby valuable to an observer.

Metadata is often thought of in three main categories:

I. Descriptive. The simplest, oldest and most common type, descriptive metadata includes, among others, the 15 properties used by the Dublin Core Metadata Initiative used to describe resources:

1. Contributor 2. Coverage 3.Creator 4. Date 5. Description  6. Format 7. Identifier 8.  Language 9.  Publisher 10. Relation 11. Rights 12.Source 13. Subject 14. Title  15. Type

For digital data such as blog posts and so forth tags, categories, usernames and IDs, number of comments, ratings and so forth constitute additional metadata.

II. Structural. Describes how objects are compiled, organised or designed, such as tables of contents, page and chapter numbering, indices as well as relational data such as image X was included in document Y.

III. Administrative. This includes file types, access permissions, rights metadata, preservation data required for archiving, image resolution, file formats, compression type used, license information, copyright dates and so forth.

The cure

Considering how much personal data human beings create, to what extent might intelligent metadata management be the cure? As Deloitte notes, “managing your metadata is a prerequisite for providing insight into data flows and related controls in your organisation”, which seems obvious considering that databases cannot be effectively queried or managed without metadata.

Indeed, metadata are an indispensable tool for companies to be able to even know, let alone compile, information held about an individual. Just as the work of linguistic psychologist James Pennebaker has shown that the otherwise invisible function words of a language (such as pronouns, articles and so forth) can illuminate intention, age and even personality traits of a speaker, metadata describe form and function of data objects in the informational multiverse.

As every data object may be described by many metadata objects, which themselves may be shared across many other data objects, each thread in the data-tapestry has the ability to reveal the greater tale that individual data object represents.

"Each thread in the data-tapestry has the ability to reveal the greater tale that individual data object represents."

That story however is invisible at worst and incomplete at best, without an intelligent and comprehensive metadata strategy. With proper planning, a company will be able to meet the requirements of GDPR with no more than a few clicks, obtaining a complete record of every document, comment, or interaction pertaining to a customer or employee, at any given point in time. The rewards go well beyond compliance however, as metadata together with unstructured data comprise up to 90 per cent of a company’s data landscape, known as ‘dark data’, or “the information assets organisations collect, process, and store during regular business activities, but generally fail to use for other purposes.”

Indeed, metadata management is vital for understanding and unlocking the value of dark data and may lead to powerful insights into business operations, whilst completely respecting regulations.

"With proper planning, a company will be able to meet the requirements of GDPR with no more than a few clicks."

We must accept that the risk of identification is an omnipresent threat in everyday life, yet the very mechanisms that make digital opacity so challenging may also be the very devices that make GDPR more than just a punitive regulation, and an actual roadmap towards personal data control and the right to be forgotten.

While GDPR aims to protect the individual, it also provides a framework within which research can be conducted safely and respectful of every individual’s rights, facilitating responsible exploration over exploitation. We should remain vigilant however, and remember that while crossing the road, it is always better to look left and right. Not everyone follows the rules, and the rules themselves are rarely comprehensive.

Sean MacNiven

Sean MacNiven is the global head of search and community within SAP’s product support organisation. Prior to this role, he led the public relations social media team, and the implementation and development of most of SAP’s key internal and external online news platforms. He is also an active researcher around communications, reputation and change management.