A Little More Conversation, A Little Less Action — A Case Against Premature Data Integration


I talk to [large] organisations that have not yet properly started with Data Science (DS) and Machine Learning (ML), they often tell me that they have to run a data integration project first, because “…all the data is scattered across the organisation, hidden in silos and packed away at odd formats on obscure servers run by different departments.”

While it may be true that the data is hard to get at, running a large data integration project before embarking on the ML part is easily a bad idea. This, because you integrate data without knowing its use — the chances that the data is going to be fit for purpose in some future ML use case is slim, at best.

In this article, I discuss some of the most important drivers and pitfalls for this kind of integration projects, and rather suggest an approach that focuses on optimising value for money in the integration efforts. The short answer to the challenge is [spoiler alert…] to integrate data on a use-case-per-use-case basis, working backwards from the use case to identify exactly the data you need.

A desire for clean and tidy data

It is easy to understand the urge for doing data integration prior to starting on the data science and machine learning challenges. Below, I list four drivers that I often meet. The list is not exhaustive, but covers the most important motivations, as I see it. We will then go through each driver, discussing their merits, pitfalls and alternatives.

  1. Cracking out AI/ML use cases is difficult, and even more so if you don’t know what data is available, and of which quality.
  2. Snooping out hidden-away data and integrating the data into a platform seems like a more concrete and manageable problem to solve.
  3. Many organisations have a culture for not sharing data, and focusing on data sharing and integration first, helps to change this.
  4. From history, we know that many ML projects grind to a halt due to data access issues, and tackling the organisational, political and technical challenges prior to the ML project may help remove these barriers.

There are of course other drivers for data integration projects, such as “single source of truth”, “Customer 360”, FOMO, and the basic urge to “do something now!”. While important drivers for data integration initiatives, I don’t see them as key for ML-projects, and therefore will not discuss these any further in this post.

1. Cracking out AI/ML use cases is difficult,

… and even more so if you don’t know what data is available, and of which quality. This is, in fact, a real Catch-22 problem: you can’t do machine learning without the right data in place, but if you don’t know what data you have, identifying the potentials of machine learning is essentially impossible too. Indeed, it is one of the main challenges in getting started with machine learning in the first place [See “Nobody puts AI in a corner!” for more on that]. But the problem is not solved most effectively by running an initial data discovery and integration project. It is better solved by an awesome methodology, that is well proven in use, and applies to so many different problem areas. It is called talking together. Since this, to a large extent, is the answer to several of the driving urges, we shall spend a few lines on this topic now.

The value of having people talking to each other cannot be overestimated. This is the only way to make a team work, and to make teams across an organisation work together. It is also a very efficient carrier of information about intricate details regarding data, products, services or other contraptions that are made by one team, but to be used by someone else. Compare “Talking Together” to its antithesis in this context: Produce Comprehensive Documentation. Producing self-contained documentation is difficult and expensive. For a dataset to be usable by a third party solely by consulting the documentation, it has to be complete. It must document the full context in which the data must be seen; How was the data captured? What is the generating process? What transformation has been applied to the data in its current form? What is the interpretation of the different fields/columns, and how do they relate? What are the data types and value ranges, and how should one deal with null values? Are there access restrictions or usage restrictions on the data? Privacy concerns? The list goes on and on. And as the dataset changes, the documentation must change too.

Now, if the data is an independent, commercial data product that you provide to customers, comprehensive documentation may be the way to go. If you are OpenWeatherMap, you want your weather data APIs to be well documented — these are true data products, and OpenWeatherMap has built a business out of serving real-time and historical weather data through those APIs. Also, if you are a large organisation and a team finds that it spends so much time talking to people that it would indeed pay off making comprehensive documentation — then you do that. But most internal data products have one or two internal consumers to begin with, and then, comprehensive documentation doesn’t pay off.

On a general note, Talking Together is actually a key factor for succeeding with a transition to AI and Machine Learning altogether, as I write about in “Nobody puts AI in a corner!”. And, it is a cornerstone of agile software development. Remember the Agile Manifesto? We value individuals and interaction over comprehensive documentation, it states. So there you have it. Talk Together.

Also, not only does documentation incur a cost, but you are running the risk of increasing the barrier for people talking together (“read the $#@!!?% documentation”).

Now, just to be clear on one thing: I am not against documentation. Documentation is super important. But, as we discuss in the next section, don’t waste time on writing documentation that is not needed.

2. Snooping out hidden away data and integrating the data into a platform seems as a much more concrete and manageable problem to solve.

Yes, it is. However, the downside of doing this before identifying the ML use case, is that you only solve the “integrating data in a platform” problem. You don’t solve the “gather useful data for the machine learning use case” problem, which is what you want to do. This is another flip side of the Catch-22 from the previous section: if you don’t know the ML use case, then you don’t know what data you need to integrate. Also, integrating data for its own sake, without the data-users being part of the team, requires very good documentation, which we have already covered.

To look deeper into why data integration without the ML-use case in view is premature, we can look at how [successful] machine learning projects are run. At a high level, the output of a machine learning project is a kind of oracle (the algorithm) that answers questions for you. “What product should we recommend for this user?”, or “When is this motor due for maintenance?”. If we stick with the latter, the algorithm would be a function mapping the motor in question to a date, namely the due date for maintenance. If this service is provided through an API, the input can be {“motor-id” : 42} and the output can be {“latest maintenance” : “March 9th 2026”}. Now, this prediction is done by some “system”, so a richer picture of the solution could be something along the lines of

Image by the author.

The key here is that the motor-id is used to obtain further information about that motor from the data mesh in order to do a robust prediction. The required data set is illustrated by the feature vector in the illustration. And exactly which data you need in order to do that prediction is difficult to know before the ML project is started. Indeed, the very precipice on which every ML project balances, is whether the project succeeds in figuring out exactly what information is required to answer the question well. And this is done by trial and error in the course of the ML project (we call it hypothesis testing and feature extraction and experiments and other fancy things, but it’s just structured trial and error).

If you integrate your motor data into the platform without these experiments, how are you going to know what data you need to integrate? Surely, you could integrate everything, and keep updating the platform with all the data (and documentation) to the end of time. But most likely, only a small amount of that data is required to solve the prediction problem. Unused data is waste. Both the effort invested in integrating and documenting the data, as well as the storage and maintenance cost for all time to come. According to the Pareto rule, you can expect roughly 20% of the data to provide 80% of the data value. But it is hard to know which 20% this is prior to knowing the ML use case, and prior to running the experiments.

This is also a caution against just “storing data for the sake of it”. I’ve seen many data hoarding initiatives, where decrees have been passed from top management about saving away all the data possible, because data is the new oil/gold/cash/currency/etc. For a concrete example; a few years back I met with an old colleague, a product owner in the mechanical industry, and they had started collecting all sorts of time series data about their machinery some time ago. One day, they came up with a killer ML use case where they wanted to take advantage of how distributed events across the industrial plant were related. But, alas, when they looked at their time series data, they realised that the distributed machine instances did not have sufficiently synchronised clocks, leading to non-correlatable time stamps, so the planned cross correlation between time series was not feasible after all. Bummer, that one, but a classical example of what happens when you don’t know the use case you are gathering data for.

3. Many organisations have a culture for not sharing data, and focusing on data sharing and integration first, helps to change this culture.

The first part of this sentence is true; there is no doubt that many good initiatives are blocked due to cultural issues in the organisation. Power struggles, data ownership, reluctance to share, siloing etc. The question is whether an organisation wide data integration effort is going to change this. If someone is reluctant to share their data, having a creed from above stating that if you share your data, the world is going to be a better place is probably too abstract to change that attitude.

However, if you interact with this group, include them in the work and show them how their data can help the organisation improve, you are much more likely to win their hearts. Because attitudes are about feelings, and the best way to deal with differences of this kind is (believe it or not) to talk together. The team providing the data has a need to shine, too. And if they are not being invited into the project, they will feel forgotten and ignored when honour and glory rains on the ML/product team that delivered some new and fancy solution to a long standing problem.

Remember that the data feeding into the ML algorithms is a part of the product stack — if you don’t include the data-owning team in the development, you are not running full stack. (An important reason why full stack teams are better than many alternatives, is that inside teams, people are talking together. And bringing all the players in the value chain into the [full stack] team gets them talking together.)

I have been in a number of organisations, and many times have I run into collaboration problems due to cultural differences of this kind. Never have I seen such barriers drop due to a decree from the C-suit level. Middle management may buy into it, but the rank-and-file employees mostly just give it a scornful look and carry on as before. However, I have been in many teams where we solved this problem by inviting the other party into the fold, and talking about it, together.

4. From history, we know that many DS/ML projects grind to a halt due to data access issues, and tackling the organisational, political and technical challenges prior to the ML project may help remove these barriers.

While the paragraph on cultural change is about human behaviour, I place this one in the category of technical states of affairs. When data is integrated into the platform, it should be safely stored and easy to obtain and use in the right way. For a large organisation, having a strategy and policies for data integration is key. But there is a difference between rigging an infrastructure for data integration together with a minimum of processes around this infrastructure, to that of scavenging through the enterprise and integrating a shit load of data. Yes, you need the platform and the policies, but you don’t integrate data before you know that you need it. And, when you do this step by step, you can benefit from iterative development of the data platform too.

A basic platform infrastructure should also come with the necessary policies to ensure compliance to regulations, privacy and other concerns. Concerns that come with being an organisation that uses machine learning and artificial intelligence to make decisions, that trains on data that may or may not be generated by individuals that may or may not have given their consent to different uses of that data.

But to circle back to the first driver, about not knowing what data the ML projects may get their hands on — you still need something to help people navigate the data residing in various parts of the organisation. And if we are not to run an integration project first, what do we do? Establish a catalogue where departments and teams are rewarded for adding a block of text about what kinds of data they are sitting on. Just a brief description of the data; what kind of data, what it is about, who are stewards of the data, and perhaps with a guess to what it can be used for. Put this into a text database or similar structure, and make it searchable . Or, even better, let the database back an AI-assistant that allows you to do proper semantic searches through the descriptions of the datasets. As time (and projects) passes by, the catalogue can be extended with further information and documentation as data is integrated into the platform and documentation is created. And if someone queries a department regarding their dataset, you may just as well shove both the question and the answer into the catalogue database too.

Such a database, containing mostly free text, is a much cheaper alternative to a readily integrated data platform with comprehensive documentation. You just need the different data-owning teams and departments to dump some of their documentation into the database. They may even use generative AI to produce the documentation (allowing them to check off that OKR too 🙉🙈🙊).

5. Summing up

To sum up, in the context of ML-projects, the data integration efforts should be attacked by:

  1. Establish a data platform/data mesh strategy, together with the minimally required infrastructure and policies.
  2. Create a catalogue of dataset descriptions that can be queried by using free text search, as a low-cost data discovery tool. Incentivise the different groups to populate the database through use of KPIs or other mechanisms.
  3. Integrate data into the platform or mesh on a use case per use case basis, working backwards from the use case and ML experiments, making sure the integrated data is both necessary and sufficient for its intended use.
  4. Solve cultural, cross departmental (or silo) barriers by including the relevant resources into the ML project’s full stack team, and…
  5. Talk Together

Good luck!

Regards
-daniel-

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here