Multi-Agent System to Automate the Selection and Evaluation of Machine Learning and Deep Learning Models

The proposal is to create an ecosystem of agents in Watson Orchestrate capable of working collaboratively to analyze structured and unstructured data, identify business objectives, and automatically select the best Machine Learning or Deep Learning models.

The workflow would be composed of the following specialized agents:

Database Analysis Agent

Scans the database.

Generates a full report of the structure, table relationships, data types, and their possible business purpose.

Identifies key tables and critical relationships.

Data Analyst Agent

Interprets the report from the database agent.

Determines which types of information could provide value (e.g., transactions, historical records, customer metrics, inventory data).

Proposes prediction objectives, such as fraud detection, sales forecasting, revenue prediction, customer churn analysis, or inventory management.

Dataset Generator Agent

Builds datasets based on the prioritized information.

Cleans, normalizes, and transforms data according to training requirements.

Documents data preparation processes for traceability.

AutoML Training Agent

Uses an engine like H2O AutoML to generate and evaluate multiple pipelines with different models and configurations.

Selects the best-performing models according to predefined metrics (accuracy, recall, F1, etc.).

Coding / Preprocessing Agent

If errors or inconsistencies are detected in the data, this agent corrects, filters, or reformats the datasets.

Resubmits them to the AutoML training agent for model regeneration.

Visualization and Reporting Agent

Produces comparative charts of models, metrics, and projections.

Creates dashboards and executive reports showing the viability of the trained models and their applications in different business areas.

Expected Benefits

Extreme automation of the ML lifecycle, from data exploration to model selection.

Accelerated time-to-value, reducing the need for manual intervention from experts at every stage.

AI democratization, allowing non-technical users to obtain insights and ready-to-use models for business problems.

Scalability, with the ability to integrate multiple data sources and generate prediction objectives in parallel.

Increased competitiveness, positioning Watson Orchestrate as a unique AI orchestration platform in the market.

With this approach, Watson Orchestrate would evolve into a platform where artificial intelligence orchestrates itself, enabling businesses to transform raw data into intelligent, actionable decisions with minimal friction.

Idea priority

Medium

Post comment

Guest

Oct 21, 2025

The idea is to integrate watsonx Orchestrate as the multi-agent orchestrator and use the IBM Cloud ML training service (e.g., Watson Machine Learning / watsonx.ai training) for scalable model training, while keeping an option to run everything locally for sensitive data—cloud training is opt-in and uses secure private connectivity.

Integration overview
We propose using watsonx Orchestrate to coordinate the multi-agent pipeline (Database Analysis, Data Analyst, Dataset Generator, AutoML Trainer, Analyst/Validator, Visualization). For scalable training, the AutoML Training Agent can offload training jobs to the IBM Cloud ML training service (Watson Machine Learning / watsonx.ai training) via secure APIs. For customers with strict privacy requirements, the pipeline can run entirely on-prem or in a private VPC—cloud training is optional and activated only by client consent.
Key benefits
Scalability & performance: leverage IBM Cloud GPUs and managed training clusters for heavy DL workloads.
Governance & traceability: use IBM Cloud registries and model management for versioning, lineage and audit trails.
Security & compliance: support private endpoints/VPC, encrypted transit and at-rest storage, and IAM roles for least-privilege access.
Flexibility & privacy: default local execution for sensitive data; seamless opt-in offload to IBM Cloud when desired.
Operational flow (summary)
Orchestrator (watsonx Orchestrate) triggers agents and manages the pipeline.
AutoML Training Agent packages dataset + training config and, if selected, calls the IBM Cloud training API to submit a job.
IBM Cloud runs the job (AutoML / custom training), stores artifacts (model, metrics, logs) in the cloud registry.
Artifacts and metrics are returned to the orchestrator for the Analyst/Validator Agent to evaluate, version, and decide on deployment or re-training.
For privacy-sensitive customers, the same steps run locally (H2O AutoML / local training infra) with identical artifact/versioning formats.
Security & connectivity (high level)
Use private endpoints or VPC peering to avoid public internet exposure.
Enforce TLS, encryption at rest (AES-256), and IAM roles for API access.
Docker sandboxing for any generated code; default network egress disabled unless explicitly permitted

Reply
Hide replies

Guest

Sep 12, 2025

images

Screenshot 2025...

Screenshot 2025...

Screenshot 2025...

Screenshot 2025...

Screenshot 2025...

Screenshot 2025...

Screenshot 2025-09-09 181102.png
Open full size
Screenshot 2025-09-09 181117.png
Open full size
Screenshot 2025-09-09 181714.png
Open full size
Screenshot 2025-09-09 181628.png
Open full size
Screenshot 2025-09-09 181355.png
Open full size
Screenshot 2025-09-09 181257.png
Open full size
Screenshot 2025-09-09 181102.png Screenshot 2025-09-09 181117.png Screenshot 2025-09-09 181714.png Screenshot 2025-09-09 181628.png Screenshot 2025-09-09 181355.png Screenshot 2025-09-09 181257.png

Reply
Hide replies

Guest

Sep 12, 2025
This a demo >

The inspiration for this- Harnessing gpt-oss for Superior Reasoning: We are incredibly proud that using gpt-oss as the brain reduced the agents' code self-correction cycle by 50% compared to other AI models. It demonstrated significant improvements in task execution, tool use, and understanding the agent workflow project stemmed from two core ideas: the desire to democratize deep learning and machine learning for non-experts and the need to overcome the immense barrier of data privacy that prevents many companies from adopting AI. While powerful tools like H2O AutoML exist, they often require significant technical expertise. We saw the recent advancements in AI agents, particularly frameworks like watsonx Orchestrate ADK, as the perfect opportunity to bridge this gap. The central idea was, "What if anyone could turn their data into valuable predictions simply by stating their goal in plain language?"
The "eureka" moment came with the rise of powerful open-source models like gpt-oss-120b. We realized we could build a system where the advanced reasoning of a top-tier model runs entirely locally. Companies possess incredibly sensitive data they can't send to external cloud APIs due to security risks. This project was born from the vision of bringing the AutoML process to the data, not the other way around, creating a secure, self-contained, and intelligent system that makes advanced data science accessible to everyone.
The system allows for versioning of each workout, saving the complete context of the process. This enriches future interactions, as users can refer to past workouts to refine their goals and provide more detailed instructions with less effort, creating a cycle of continuous improvement. Additionally, a SQL Agent (in beta) has been incorporated, allowing users to generate complex queries from natural language instructions. The results of these queries can be used directly as a data source for the rest of the agents, further streamlining the analysis process.
The system was successfully tested with real sales data from two Uruguayan online stores, wiki.com.uy and decotech.uy. Several deep learning and machine learning models were trained for different purposes, and the results were astounding: users with deep knowledge of their business but little experience in ML were able to train powerful models using different filters and features. It was like having a machine learning and deep learning analyst at the disposal of people who understand how their business works but lack technical expertise in AI. The system delivers trained models, predictions, and visualizations through a simple web dashboard.
What it does
This project is a fully autonomous, privacy-focused machine learning - deep learning system powered by a team of seven specialized AI agents. A user simply uploads a dataset, defines their objective in natural language (e.g., "predict customer churn based on this data"), and the system handles the entire complex ML workflow from start to finish.
The core value proposition is 100% Data Privacy. Because the entire process—from data analysis and code generation to model training—is orchestrated by a locally-run gpt-oss model, sensitive data never leaves the user's local infrastructure. This unlocks the value of "trapped" data that was previously off-limits to cloud-based AI, enabling industries like finance, healthcare, and R&D to leverage their most valuable assets securely. The system outputs trained models, predictions, and visualizations through a simple web dashboard.
How we built it
We built the system using a modular, agent-based architecture:
Local AI Brain: The intelligence of each agent is powered by a gpt-oss model (e.g., gpt-oss-120b) running on a local inference server like Ollama or vLLM.
Core ML Engine: We chose H2O AutoML for its robustness and proven ability to automatically find high-performing models.
Agent Framework: We used the watsonx Orchestrate ADK framework to create a specialized team of AI agents. Instead of one monolithic model, we designed a "team of experts," where each of the seven agents has a unique, well-defined role (e.g., DataProcessorAgent, ModelBuilderAgent).
Secure Execution: Security was a top priority. All Python code generated by the agents is executed within an isolated Docker container (a sandbox), preventing access to the host system and ensuring dependencies are managed cleanly.
Web Interface and API: To make the system user-friendly, we built a web dashboard using FastAPI for the backend and simple HTML/JS for the frontend. This allows users to easily upload files, monitor training in real-time, and view results.
Orchestration: A central "Pipeline Orchestrator" manages the entire workflow, deciding which agent to invoke at each step and passing the necessary information between them, from initial data analysis to final visualization.
Challenges we ran into
Agent Coordination: The biggest challenge was ensuring seamless communication between agents. Getting the DataProcessorAgent's analysis report to be perfectly understood by the ModelBuilderAgent to generate correct code required extensive trial and error and prompt refinement.
State Management: Docker containers are stateless. Managing the project's state (file paths, model artifacts, logs) across multiple, separate Docker executions for different pipeline stages was a significant architectural hurdle.
Handling AI Non-Determinism: LLMs can be unpredictable. An agent would sometimes generate flawed code or misinterpret a result. Building robust error-handling logic and retries, especially the feedback loop with the AnalystAgent, was crucial for making the pipeline reliable.
Local Model Optimization: A key challenge was balancing the great potential of gpt-oss with the performance limitations of running it locally. We evaluated several models; some, like gemma3 27B, qwen3 32B, and deepseek-v2 16B, often fell short of gpt-oss-120B, requiring more agent calls for corrections, losing focus, or even getting stuck for hours. On the other hand, much larger models such as deepseek-v2 236B and qwen3 235B were simply too big, consuming excessive resources and taking longer to complete tasks in a local environment, which made finding the right model a critical challenge.
Dynamic Adjustment: It was challenging to make the agents smart enough to recognize when a trained model's metrics were poor. We had to build logic that allowed them to detect this situation and decide whether to improve, adjust, or change the data and parameters to achieve better performance.
Accomplishments that we're proud of
Real-World Validation and Empowerment:We successfully tested the system with data from the online stores wiki.com.uy and decotech.uy. We demonstrated that users with business knowledge but no ML experience could, on their own, train models and achieve astounding results. We managed to put a virtual "machine learning data analyst" at their disposal.
Achieving True Data Privacy: We successfully created a powerful AutoML system where sensitive data never leaves the user's machine, addressing a major blocker for AI adoption in regulated industries.
Harnessing gpt-oss for Superior Reasoning: We are incredibly proud that using gpt-oss as the brain reduced the agents’ code self-correction cycle by 50% compared to other AI models. It demonstrated significant improvements in task execution, tool use, and understanding the agent workflow.
Building a Self-Healing System: The implementation of the AnalystAgent acts as a quality control specialist. It reviews the code and results from other agents and can send tasks back for correction, creating a robust, self-healing workflow.
The Power of Specialization: We proved that dividing a complex problem like an AutoML pipeline into smaller tasks for specialized agents is far more effective than a monolithic approach. This strategy gave the gpt-oss model reasoning "superpowers" for specific tasks.
What we learned
gpt-oss Excels in Reasoning: This project was a deep dive into practical multi-agent systems. The most critical lesson was that gpt-oss excels at reasoning tasks. It significantly sped up development and improved reliability, completing data processing, input handling, and visualization tasks more quickly and with fewer errors.
Prompt Engineering is Everything: The quality of the system's output is directly tied to the quality of the prompts given to each agent. We spent considerable time refining prompts to ensure each agent understood its role, limitations, and the exact format of its expected output.
Self-Correction Loops are a Must: We learned the importance of having a "validator" agent. The AnalystAgent acts as a quality assurance layer, creating a robust system that can catch and fix its own mistakes.
Specialization Unlocks Potential: Breaking a complex problem down for specialized agents is a highly effective design pattern. It allows the core LLM to focus its reasoning power on well-defined tasks, leading to better and more reliable results.
What's next for Multi-Agent Auto Machine Learning - Deep Learning System
Expanded ML Capabilities: We plan to extend the system's capabilities beyond tabular data to include time-series forecasting, NLP tasks, and eventually computer vision.
Enhanced User Interaction: We want to build a more interactive UI where users can collaborate with the agents, tweak parameters during the process, and perform more in-depth model comparisons.
Custom Agent Workflows: Allow users to define or customize their own agent workflows, adding or removing steps to tailor the pipeline to specific, unique business problems.
Advanced Feature Engineering: Empower the agents with more sophisticated tools for automatic feature engineering and data enrichment to further improve model performance.
Summary of an execution where the objective was to predict 30-day sales
The goal was to predict the total sales amount for the next 100 days from a CSV file. The process was fully automated by a team of artificial intelligence agents.
Main Phases
Data Analysis (Start: 3:45 PM) The system analyzed the ventas.csv file and detected that the data had a particular format:
- Semicolon (;) as a separator.
- Comma (,) as a decimal separator.
- Dates in Spanish format (e.g., "15 Nov. 2023").
- No header row; the first line already contained sales data.
First Training Attempt (Failed)
- A script was generated to train a Machine Learning or Deep Learning model.
- Result: The training failed.
- Error Cause: The AnalystAgent determined that although features based on date (such as year, month, day of the week) were created, they were mistakenly excluded from the training process.
- This left the model without predictive data to learn from, causing the failure.
Correction and Second Training (Successful) Based on the error analysis, the ModelBuilderAgent corrected the script to:
- Properly include all generated time features.
- Improve handling of null values (NaN) in the time series features.
Result:
The second attempt was successful. A Gradient Boosting Machine (GBM) model was trained with acceptable performance (R² = 0.63, meaning it explains approximately 63% of sales variability).
Prediction Generation Using the trained model, the PredictionAgent generated a sales forecast for the next 30 days and saved it in predictions.csv.
Results Visualization The VisualizationAgent combined the historical data and the new predictions to create a chart.
- There was a small issue where the chart wasn’t saved on the first attempt.
- The system detected it and successfully regenerated the visualization, saving it as forecast_plot.png.
Final Conclusion The pipeline successfully completed its objective. Despite an initial training failure, the system was able to diagnose the issue, correct it automatically, and complete all phases of the process.
Final Outcome
- Trained Model: A GBM model ready to predict sales.
- Predictions File: predictions.csv
- Forecast Chart: forecast_plot.png (shows historical trends and the future forecast).
The entire process took approximately 30 minutes.
Agents2ML.mp4

Agents2ML.mp4
Agents2ML.mp4
Open full size
Agents2ML.mp4

Reply
Hide replies

By clicking the "Post Comment" or "Add Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not include IBM confidential, company confidential, or personal information in any field.
Having problems accessing this portal? Describe the problems in an email to ideasibm@us.ibm.com.

Please enter your email address

RELATED IDEAS

Multi-Agent System to Automate the Selection and Evaluation of Machine Learning and Deep Learning Models