January 17, 2024

If You're Collecting Human Data, Respect Data Provenance

What you’ll learn:

Show me the summary


In the dynamic intersection of technology, health, and human behavior, insights lie in the intricacies of human data. Human data is unique, and to ensure high-quality, accurate, and trustworthy insights, organizations collecting and analyzing it must pay respect to data provenance. 

Keeping track of where data comes from, how it changes, and how it's used is crucial for data to be trusted to support decision-making. This helps us confirm that the data is accurate and reliable and helps improve data for future use, minimizing bias and rendering it well-suited for integration with artificial intelligence (AI).

What is data provenance?

Data provenance, in its essence, is the story of data – the origins, journey, and evolution of a data set - that helps to establish validity, track change, and provide transparency on how conclusions were reached. 

For human data, this narrative is particularly crucial. Whether collected through clinical assessments, wearables, or self-reported surveys, the provenance of human data adds layers of context and reliability, as well as legal and ethical considerations, to ensure your data is valuable now and in the future.

Because of its breadth, we will explain data provenance considerations utilizing a simple framework. These are the questions we should be able to ask of any human observation data set:

  1. Where did this data come from?
  2. What has been done to this data?
  3. What is the quality of this data?
  4. What are the ethical or legal implications of this data?

Where did this data come from?

Some of the most essential considerations for data provenance are understanding the data's origin, source, and general context about the data being generated. This information, the data about data, is collectively known as meta-data

Blog Examples of Meta Data (2)-1

Collecting and storing human data in its raw form is important but is more of a consideration for the completeness of data collection. Data provenance instead tracks the information that allows us to understand, for example…

  • What device or sensor was used?
  • What was the serial number?
  • Who or what entity generated or recorded the data?
  • When was the data collected?
  • And under what conditions?

This meta-data provides the critical information needed to trust and analyze the data and provide transparency later in the data's lifecycle.

Does it sound like overkill? We'll provide some simple examples of why this can be important.

The origin of data provides vital pieces of information. Human data analysis can be influenced by various factors such as hardware type, time of data capture, and other environmental conditions. For example, individual athletes, warfighters, or employees may use different devices (e.g., Garmin watch and Oura ring) to track sleep patterns, yielding different results. Even if the same device is used, data collected during the winter months may systematically differ from data collected in the summer! Capturing the temporal and environmental context ensures the data is interpreted and analyzed within the appropriate framework.

What has been done to this data?

With measurement devices like wearables consumerized, it is easy to forget that much of the human data we interact with is first captured by sensors that generate time-series data. This data is processed and transformed by software into results we can interpret, for example, Sleep, Readiness, or Fatigue Scores.  While the initial collection of raw data is a pivotal step, what happens next—how the data is processed, manipulated, and analyzed—shapes its usability and trustworthiness. For example…

  • What data processing has taken place?

This can include cleaning, filtering, normalization, or conversion to a standardized format. Any adjustments made to data to account for biases or outliers often occur during the initial processing of data.

heart rate

  • How are the different features or metrics extracted from the data?

Once the raw data is processed, relevant features, like max heart rate or total sleep duration, are often extracted. These algorithms are frequently updated iteratively, necessitating version management and documentation within our meta-data. 

  • Has the data been aggregated from multiple sources?

If so, it can be critical to understand what criteria were applied during data aggregation. For example, different sensors may have different expected levels of accuracy. In contrast, different subpopulations (e.g., sedentary older adults vs elite special forces operators) may require additional considerations for detecting outliers. 

This comprehensive documentation of data transformation ensures that stakeholders interacting with the data can understand how it changed from its initial collection until now and ensure it is analyzed and interpreted correctly. 

What is the quality of this data?

Ensuring the quality of human data involves assessing its accuracy and precision and addressing any known limitations or sources of error. Key considerations include:

  • How accurate is the data, considering the measurement tools and methods?
  • What is the reliability of the extracted features or metrics?
  • Have accuracy assessments or calibrations been performed on the data collection instruments?

Reliability Analysis

Metric reliability is a unique consideration with human data that becomes crucial in scenarios where minute changes in data are significant. For instance, monitoring blood glucose levels to detect subtle fluctuations that could impact medical decisions. Quality assurance is paramount because if users, researchers, or decision-makers cannot trust the data's origin, handling, and context, the entire foundation of data-driven endeavors becomes shaky.

Legal and ethical considerations?

Human data comes with ethical responsibilities and legal implications. The ethical considerations involve:

  • Are there any ethical considerations associated with the data collection process?
  • Was informed consent obtained from individuals providing the data?
  • How is sensitive or personally identifiable information handled and protected?

Data anonymization, encryption, and secure storage mechanisms become essential to prevent unintended disclosures that could compromise individual privacy. Reproducibility and transparency in data processing further contribute to the ethical use of human data. 

The Biggest Risk in Ignoring Data Provenance

Neglecting data provenance in human data introduces a profound risk: the erosion of trust.

Trust is the bedrock upon which any meaningful analysis, decision-making, or innovation stands. As organizations look to leverage human data to optimize employee health and performance, data provenance ensures that data is stored and analyzed correctly, regardless of which commercial device or algorithm version was used during data capture.

The devil is in the details, and the details of data provenance transform raw data into valuable, reliable insights. Ignoring these considerations not only jeopardizes the accuracy of current analyses but also hinders the potential for future advancements, especially in the realm of artificial intelligence, where the quality of input data directly influences the reliability of outcomes.

As technology advances and data becomes an increasingly valuable commodity, organizations must prioritize meticulously documenting data provenance. This safeguards the quality and reliability of current analyses and establishes a foundation of trust that is vital for the responsible and ethical use of human data in the ever-evolving landscape of technology and artificial intelligence.

Take Home

  1.  Data Provenance is Essential: The meticulous attention to data provenance is not an excessive detail; it's a fundamental aspect of responsible data handling.
  2. Trust is Paramount: Trust is the currency of data-driven endeavors. Neglecting data provenance compromises this trust and undermines the foundation of analyses and decision-making.
  3. Ethical Considerations are Non-Negotiable: Legal and ethical considerations are integral to data provenance. Obtaining informed consent, handling sensitive information responsibly, and ensuring transparency contribute to the ethical use of human data.
  4. Future-Proof Your Data: Meticulous documentation of data provenance isn't just for the present; it's an investment in the future. It paves the way for advancements, especially in the realm of artificial intelligence.

Other posts you might be interested in:

View All Posts