All AI Agentic Roads Lead to Data Engineering

The data* in your organization is messier than your wildest imagination. Regardless of whether the data is tidy or messy, it will be the backbone of every AI agentic process. Business leaders who underestimate the massive gap between raw data and utility and the challenges in addressing this gap are destined to misallocate AI-related investments in their organization. The conversation surrounding agentic processes will always be one that revolves around how value is extracted from data.
Jaya Gupta reignited this conversation in her recently published article on the idea of context graphs, a new opportunity for data stores to capture the key elements that led to a decision for a specific event. The messy, unstructured notes across Slack channels, emails, and meetings that led to this event could get summarized in a novel data store, the context graph. This unstructured information would be dependent upon AI agentic processes.
In “Long Live Systems of Record“, Jamin Ball emphasized that the fragility point for any agentic process is the source of truth, which means using the โright value from the right system at the right timeโ. He goes on to describe how the simple goal of calculating Annual Recurring Revenue (ARR) will net you different answers depending on which department you ask, which data sources they use, and which definition they think is appropriate. This reminds me of the common economics joke where if you โput 2 economists in a room and youโll get 3 opinionsโ. A corollary to this for AI would be that if you ask 2 financial analysts to use an AI model to calculate your organizationโs ARR in 2025, you will get an infinite number of different answers every time you ask them to run the model.
The problem with using agentic processes to build a context graph, calculate a metric, or perform an analysis is that choosing the โrightโ value to use at each step is a subjective choice. While each input is subjective, some values are inherently better than others. In this new AI era, agentic processes automate what can be verified and subjective choices cannot be verified without human intervention and/or explicit instructions at every step of the way (e.g., deterministic software).
At this point, a non-technical professional or business leader with some AI familiarity will propose that adding the necessary context will be sufficient for AI to automate the process. This could not be further from the truth because the gap between raw data and utility is much larger than your wildest imagination.
In a typical scenario, we can assume that the agentic process has access to the functional specifications, technical specifications, the list of data sources, and general prompt instructions. Some common examples of challenges that might arise from an ARR calculation with this context include the following:
- The schema for the data source is outdated or incomplete.
- There are 10+ different columns across 10 different tables that could be used for price when calculating ARR. Which one do we use?
- How will the automation handle NULL values in each data attribute?
- How will the agentic process know if the SQL view used in the data source was built correctly?
- The specifications failed to mention that the join keys changed over time.
- Sales data is captured across several data sources. How do we identify situations where a sales record is entered twice into our system?
If we mirror a similar scenario for building a context graph that aims to capture the decision traces that led to a sale, there will be similar challenges with the data such as:
- Which sales were impacted by this specific meeting? There is no obvious โsemantic joinโ to match products sold to salespersons involved at the specific time when the event occurred.
- Which meetings were omitted from the record because they were in-person or outside normal communication channels?
- How many notes are extremely biased based on the salesperson who drew the wrong conclusions about why the sale was made?
- How will the agentic process know which emails are associated with a specific sale?
- Does a โsaleโ mean the same thing in every email, message, and meeting? A โsaleโ could be past tense, singular or multiple, future tense, or present tense.
There are an infinite number of situations that a data professional will encounter where the information needed to build a context graph, calculate a metric, or perform an analysis is ambiguous, biased, omitted, or only exists in someoneโs head. In order to provide all of the necessary context for an agentic process, it would require an army of full-time employees to maintain rigorous, explicit documentation to feed the agentic process. Even with all of the information, the agentic process would provide inconsistent answers each time you ask it to calculate ARR for 2025 (or build a context graph for a sale record) because it is a probabilistic process at its foundation.
Agentic processes will never have enough context to automate processes filled with subjective choices. Subjectivity destroys the agentic processesโ ability to verify, troubleshoot, and retry a process until it gets the correct answer. If there is any hope of building a verifiable methodology within a repeatable pipeline for an agentic process, it will be an army of data engineers** to ensure the automation makes the right choices along the way.
~ The Data Generalist
Data Science Career Advisor
*Data includes structured/tabular data, metadata, and unstructured data (e.g., text, images, video).
**Data engineers will need help from AI engineers, software engineers, domain experts and many other colleagues.