Evaluating Financial Intelligence in Large Language Models: Benchmarking SuperInvesting AI with LLM Engines

📄 Abstract - Evaluating Financial Intelligence in Large Language Models: Benchmarking SuperInvesting AI with LLM Engines

Large language models are increasingly used for financial analysis and investment research, yet systematic evaluation of their financial reasoning capabilities remains limited. In this work, we introduce the AI Financial Intelligence Benchmark (AFIB), a multi-dimensional evaluation framework designed to assess financial analysis capabilities across five dimensions: factual accuracy, analytical completeness, data recency, model consistency, and failure patterns. We evaluate five AI systems: GPT, Gemini, Perplexity, Claude, and SuperInvesting, using a dataset of 95+ structured financial analysis questions derived from real-world equity research tasks. The results reveal substantial differences in performance across models. Within this benchmark setting, SuperInvesting achieves the highest aggregate performance, with an average factual accuracy score of 8.96/10 and the highest completeness score of 56.65/70, while also demonstrating the lowest hallucination rate among evaluated systems. Retrieval-oriented systems such as Perplexity perform strongly on data recency tasks due to live information access but exhibit weaker analytical synthesis and consistency. Overall, the results highlight that financial intelligence in large language models is inherently multi-dimensional, and systems that combine structured financial data access with analytical reasoning capabilities provide the most reliable performance for complex investment research workflows.

评估大型语言模型的金融智能：基于LLM引擎的SuperInvesting AI基准测试 / Evaluating Financial Intelligence in Large Language Models: Benchmarking SuperInvesting AI with LLM Engines

1️⃣ 一句话总结

这篇论文提出了一个多维度的金融智能评估框架，通过测试发现，在复杂的投资研究任务中，结合了结构化金融数据访问和分析推理能力的AI系统（如SuperInvesting）表现最为可靠。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要