菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-09
📄 Abstract - LeJOT-AutoML: LLM-Driven Feature Engineering for Job Execution Time Prediction in Databricks Cost Optimization

Databricks job orchestration systems (e.g., LeJOT) reduce cloud costs by selecting low-priced compute configurations while meeting latency and dependency constraints. Accurate execution-time prediction under heterogeneous instance types and non-stationary runtime conditions is therefore critical. Existing pipelines rely on static, manually engineered features that under-capture runtime effects (e.g., partition pruning, data skew, and shuffle amplification), and predictive signals are scattered across logs, metadata, and job scripts-lengthening update cycles and increasing engineering overhead. We present LeJOT-AutoML, an agent-driven AutoML framework that embeds large language model agents throughout the ML lifecycle. LeJOT-AutoML combines retrieval-augmented generation over a domain knowledge base with a Model Context Protocol toolchain (log parsers, metadata queries, and a read-only SQL sandbox) to analyze job artifacts, synthesize and validate feature-extraction code via safety gates, and train/select predictors. This design materializes runtime-derived features that are difficult to obtain through static analysis alone. On enterprise Databricks workloads, LeJOT-AutoML generates over 200 features and reduces the feature-engineering and evaluation loop from weeks to 20-30 minutes, while maintaining competitive prediction accuracy. Integrated into the LeJOT pipeline, it enables automated continuous model updates and achieves 19.01% cost savings in our deployment setting through improved orchestration.

顶级标签: llm agents systems
详细标签: automated machine learning feature engineering execution time prediction cost optimization retrieval-augmented generation 或 搜索:

LeJOT-AutoML:基于大语言模型的特征工程用于Databricks作业执行时间预测与成本优化 / LeJOT-AutoML: LLM-Driven Feature Engineering for Job Execution Time Prediction in Databricks Cost Optimization


1️⃣ 一句话总结

这篇论文提出了一个名为LeJOT-AutoML的智能框架,它利用大语言模型自动分析作业日志和脚本,快速生成预测执行时间所需的关键特征,从而将特征工程周期从数周缩短至半小时,并在实际部署中帮助节省了约19%的云计算成本。

源自 arXiv: 2603.07897