Ivory Tower Notes: The Problem | Towards Data Science


months on a Machine Learning project, only to discover you never defined the “correct” problem at the start? If so, or even if not, and you are only starting with the data science or AI field, welcome to my first Ivory Tower Note, where I will address this topic. 


The term “Ivory Tower” is a metaphor for a situation in which someone is isolated from the practical realities of everyday life. In academia, the term often refers to researchers who engage deeply in theoretical pursuits and remain distant from the realities that practitioners face outside academia.

As a former researcher, I wrote a short series of posts from my old Ivory Tower notes — the notes before the LLM era.

Scary, I know. I am writing this to manage expectations and the question, “Why ever did you do things this way?” — “Because no LLM told me how to do otherwise 10+ years ago.”

That’s why my notes contain “legacy” topics such as data mining, machine learning, multi-criteria decision-making, and (sometimes) human interactions, airplanes ✈️ and art.

Nonetheless, whenever there is an opportunity, I will map my “old” knowledge to generative AI advances and explain how I applied it to datasets beyond the Ivory Tower.

Welcome to post #1…


How every Machine Learning and AI journey starts

 — It starts with a problem. 

For you, this is usually “the” problem because you need to live with it for months or, in the case of research, years

With “the” problem, I am addressing the business problem you don’t fully understand or know how to solve at first. 

An even worse scenario is when you think you fully understand and know how to solve it quickly. This then creates only more problems that are again only yours to solve. But more about this in the upcoming sections. 

So, what’s “the” problem about?

Causa: It’s mostly about not managing or leveraging resources properly —  workforce, equipment, money, or time. 

Ratio: It’s usually about generating business value, which can span from improved accuracy, increased productivity, cost savings, revenue gains, faster reaction, decision, planning, delivery or turnaround times. 

Veritas: It’s always about finding a solution that relies and is hidden somewhere in the existing dataset. 

Or, more than one dataset that someone labelled as “the one”, and that’s waiting for you to solve the problem. Because datasets follow and are created from technical or business process logs, “there has to be a solution lying somewhere within them.

Ah, if only it were so easy.

Avoiding a different chain of thought again, the point is you will need to:

1 — Understand the problem fully,
2 — If not given, find the dataset “behind” it, and 
3 — Create a methodology to get to the solution that will generate business value from it. 

On this path, you will be tracked and measured, and time will not be on your side to deliver the solution that will solve “the universe equation.” 

That’s why you will need to approach the problem methodologically, drill down to smaller problems first, and focus entirely on them because they are the root cause of the overall problem. 

That’s why it’s good to learn how to…

Think like a Data Scientist.

Returning to the problem itself, let’s imagine that you are a tourist lost somewhere in the big museum, and you want to figure out where you are. What you do next is walk to the closest info map on the floor, which will show your current location. 

At this moment, in front of you, you see something like this: 

Process. Image by Author, inspired by Microsoft Learn

The next thing you might tell yourself is, “I want to get to Frida Kahlo’s painting.” (Note: These are the insights you want to get.)

Because your goal is to see this one painting that brought you miles away from your home and now sits two floors below, you head straight to the second floor. Beforehand, you memorized the shortest path to reach your goal. (Note: This is the initial data collection and discovery phase.)

However, along the way, you stumble upon some obstacles — the elevator is shut down for renovation, so you have to use the stairs. The museum paintings were reordered just two days ago, and the info plans didn’t reflect the changes, so the path you had in mind to get to the painting is not accurate. 

Then you find yourself wandering around the third floor already, asking quietly again, “How do I get out of this labyrinth and get to my painting faster?

While you don’t know the answer, you ask the museum staff on the third floor to help you out, and you start collecting the new data to get the correct route to your painting. (Note: This is a new data collection and discovery phase.)

Nonetheless, once you get to the second floor, you get lost again, but what you do next is start noticing a pattern in how the paintings have been ordered chronologically and thematically to group the artists whose styles overlap, thus giving you an indication of where to go to find your painting. (Note: This is a modelling phase overlapped with the enrichment phase from the dataset you collected during school days — your art knowledge.)

Finally, after adapting the pattern analysis and recalling the collected inputs on the museum route, you arrive in front of the painting you had been planning to see since booking your flight a few months ago. 

What I described now is how you approach data science and, nowadays, generative AI problems. You always start with the end goal in mind and ask yourself:

“What is the expected outcome I want or need to get from this?”

Then you start planning from this question backwards. The example above started with requesting holidays, booking flights, arranging accommodation, traveling to a destination, buying museum tickets, wandering around in a museum, and then seeing the painting you’ve been reading about for ages. 

Of course, there is more to it, and this process should be approached differently if you need to solve someone else’s problem, which is a bit more complex than locating the painting in the museum. 

In this case, you have to…

Ask the “good” questions.

To do this, let’s define what a good question means [1]: 

A good data science question must be concrete, tractable, and answerable. Your question works well if it naturally points to a feasible approach for your project. If your question is too vague to suggest what data you need, it won’t effectively guide your work.

Formulating good questions keeps you on track so you don’t get lost in the data that should be used to get to the specific problem solution, or you don’t end up solving the wrong problem.

Going into more detail, good questions will help identify gaps in reasoning, avoid faulty premises, and create alternative scenarios in case things do go south (which almost always happens)👇🏼.

Image created by Author after analyzing “Chapter 2. Setting goals by asking good questions” from “Think Like a Data Scientist” book [2]

From the above-presented diagram, you understand how good questions, first and foremost, need to support concrete assumptions. This means they need to be formulated in a way that your premises are clear and ensure they can be tested without mixing up facts with opinions.

Good questions produce answers that move you closer to your goal, whether through confirming hypotheses, providing new insights, or eliminating wrong paths. They are measurable, and with this, they connect to project goals because they are formulated with consideration of what’s possible, valuable, and efficient [2].

Good questions are answerable with available data, considering current data relevance and limitations. 

Last but not least, good questions anticipate obstacles. If something is certain in data science, this is the uncertainty, so having backup plans when things don’t work as expected is important to produce results for your project.

Let’s exemplify this with one use case of an airline company that has a challenge with increasing its fleet availability due to unplanned technical groundings (UTG).

These unexpected maintenance events disrupt flights and cost the company significant money. Because of this, executives decided to react to the problem and call in a data scientist (you) to help them improve aircraft availability.

Now, if this would be the first data science task you ever got, you would maybe start an investigation by asking:

“How can we eliminate all unplanned maintenance events?”

You understand how this question is an example of the wrong or “poor” one because:

  • It is not realistic: It includes every possible defect, both small and big, into one impossible goal of “zero operational interruptions”.
  • It doesn’t hold a measure of success: There’s no concrete metric to show progress, and if you’re not at zero, you’re at “failure.”
  • It is not data-driven: The question didn’t cover which data is recorded before delays occur, and how the aircraft unavailability is measured and reported from it.

So, instead of this vague question, you would probably ask a set of targeted questions:

  1. Which aircraft (sub)system is most critical to flight disruptions?
    (Concrete, specific, answerable) This question narrows down your scope, focusing on only one or two specific (sub) systems affecting most delays.
  2. What constitutes “critical downtime” from an operational perspective?
    (Valuable, ties to business goals) If the airline (or regulatory body) doesn’t define how many minutes of unscheduled downtime matter for schedule disruptions, you might waste effort solving less urgent issues.
  3. Which data sources capture the root causes, and how can we fuse them?
    (Manageable, narrows the scope of the project further) This clarifies which data sources one would need to find the problem solution.

With these sharper questions, you will drill down to the real problem:

  • Not all delays weigh the same in cost or impact. The “correct” data science problem is to predict critical subsystem failures that lead to operationally costly interruptions so maintenance crews can prioritize them.

That’s why…

Defining the problem determines every step after. 

It’s the foundation upon which your data, modelling, and evaluation phases are built 👇🏼.

Image created by Author after analyzing and overlapping different images from “Chapter 2. Setting goals by asking good questions, Think Like a Data Scientist” book [2]

It means you are clarifying the project’s objectives, constraints, and scope; you need to articulate the ultimate goal first and, except for asking “What’s the expected outcome I want or need to get from this?”, ask as well: 

What would success look like and how can we measure it?

From there, drill down to (possible) next-level questions that you (I) have learned from the Ivory Tower days:
 — History questions: “Has anyone tried to solve this before? What happened? What is still missing?”
 —  Context questions: “Who is affected by this problem and how? How are they partially resolving it now? Which sources, methods, and tools are they using now, and can they still be reused in the new models?”
 — Impact Questions: “What happens if we don’t solve this? What changes if we do? Is there a value we can create by default? How much will this approach cost?”
Assumption Questions: “What are we taking for granted that might not be true (especially when it comes to data and stakeholders’ ideas)?”
 — ….

Then, do this in the loop and always “ask, ask again, and don’t stop asking” questions so you can drill down and understand which data and analysis are needed and what the ground problem is. 

This is the evergreen knowledge you can apply nowadays, too, when deciding if your problem is of a predictive or generative nature. 

(More about this in some other note where I will explain how problematic it is trying to solve the problem with the models that have never seen — or have never been trained on — similar problems before.)

Now, going back to memory lane…

I want to add one important note: I have learned from late nights in the Ivory Tower that no amount of data or data science knowledge can save you if you’re solving the wrong problem and trying to get the solution (answer) from a question that was simply wrong and vague. 

When you have a problem on hand, do not rush into assumptions or building the models without understanding what you need to do (Festina lente)

In addition, prepare yourself for unexpected situations and do a proper investigation with your stakeholders and domain experts because their patience will be limited, too. 

With this, I want to say that the “real art” of being successful in data projects is knowing precisely what the problem is, figuring out if it can be solved in the first place, and then coming up with the “how” part. 

You get there by learning to ask good questions.

If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute solving it.


Thank you for reading, and stay tuned for the next Ivory Tower note.

If you found this post valuable, feel free to share it with your network. 👏

Connect for more stories on Medium ✍️ and LinkedIn 🖇️.


References: 

[1] DS4Humans, Backwards Design, accessed: April 5th 2025, https://ds4humans.com/40_in_practice/05_backwards_design.html#defining-a-good-question

[2] Godsey, B. (2017), Think Like a Data Scientist: Tackle the data science process step-by-step, Manning Publications.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here