In the dynamic intersection of technology, health, and human behavior, insights lie in the intricacies of human data. Human data is unique, and to ensure high-quality, accurate, and trustworthy insights, organizations collecting and analyzing it must pay respect to data provenance.
Keeping track of where data comes from, how it changes, and how it's used is crucial for data to be trusted to support decision-making. This helps us confirm that the data is accurate and reliable and helps improve data for future use, minimizing bias and rendering it well-suited for integration with artificial intelligence (AI).
For human data, this narrative is particularly crucial. Whether collected through clinical assessments, wearables, or self-reported surveys, the provenance of human data adds layers of context and reliability, as well as legal and ethical considerations, to ensure your data is valuable now and in the future.
Because of its breadth, we will explain data provenance considerations utilizing a simple framework. These are the questions we should be able to ask of any human observation data set:
Some of the most essential considerations for data provenance are understanding the data's origin, source, and general context about the data being generated. This information, the data about data, is collectively known as meta-data.
Collecting and storing human data in its raw form is important but is more of a consideration for the completeness of data collection. Data provenance instead tracks the information that allows us to understand, for example…
This meta-data provides the critical information needed to trust and analyze the data and provide transparency later in the data's lifecycle.
Does it sound like overkill? We'll provide some simple examples of why this can be important.
The origin of data provides vital pieces of information. Human data analysis can be influenced by various factors such as hardware type, time of data capture, and other environmental conditions. For example, individual athletes, warfighters, or employees may use different devices (e.g., Garmin watch and Oura ring) to track sleep patterns, yielding different results. Even if the same device is used, data collected during the winter months may systematically differ from data collected in the summer! Capturing the temporal and environmental context ensures the data is interpreted and analyzed within the appropriate framework.
With measurement devices like wearables consumerized, it is easy to forget that much of the human data we interact with is first captured by sensors that generate time-series data. This data is processed and transformed by software into results we can interpret, for example, Sleep, Readiness, or Fatigue Scores. While the initial collection of raw data is a pivotal step, what happens next—how the data is processed, manipulated, and analyzed—shapes its usability and trustworthiness. For example…
This can include cleaning, filtering, normalization, or conversion to a standardized format. Any adjustments made to data to account for biases or outliers often occur during the initial processing of data.
Once the raw data is processed, relevant features, like max heart rate or total sleep duration, are often extracted. These algorithms are frequently updated iteratively, necessitating version management and documentation within our meta-data.
If so, it can be critical to understand what criteria were applied during data aggregation. For example, different sensors may have different expected levels of accuracy. In contrast, different subpopulations (e.g., sedentary older adults vs elite special forces operators) may require additional considerations for detecting outliers.
This comprehensive documentation of data transformation ensures that stakeholders interacting with the data can understand how it changed from its initial collection until now and ensure it is analyzed and interpreted correctly.
Ensuring the quality of human data involves assessing its accuracy and precision and addressing any known limitations or sources of error. Key considerations include:
Metric reliability is a unique consideration with human data that becomes crucial in scenarios where minute changes in data are significant. For instance, monitoring blood glucose levels to detect subtle fluctuations that could impact medical decisions. Quality assurance is paramount because if users, researchers, or decision-makers cannot trust the data's origin, handling, and context, the entire foundation of data-driven endeavors becomes shaky.
Human data comes with ethical responsibilities and legal implications. The ethical considerations involve:
Data anonymization, encryption, and secure storage mechanisms become essential to prevent unintended disclosures that could compromise individual privacy. Reproducibility and transparency in data processing further contribute to the ethical use of human data.
Trust is the bedrock upon which any meaningful analysis, decision-making, or innovation stands. As organizations look to leverage human data to optimize employee health and performance, data provenance ensures that data is stored and analyzed correctly, regardless of which commercial device or algorithm version was used during data capture.
The devil is in the details, and the details of data provenance transform raw data into valuable, reliable insights. Ignoring these considerations not only jeopardizes the accuracy of current analyses but also hinders the potential for future advancements, especially in the realm of artificial intelligence, where the quality of input data directly influences the reliability of outcomes.
As technology advances and data becomes an increasingly valuable commodity, organizations must prioritize meticulously documenting data provenance. This safeguards the quality and reliability of current analyses and establishes a foundation of trust that is vital for the responsible and ethical use of human data in the ever-evolving landscape of technology and artificial intelligence.