AI Agent的评估与测试:如何量化智能体性能AI Agentçš„èƒ½åŠ›æ­£åœ¨å¿«é€Ÿæ¼”è¿›ï¼Œä»Žç®€å•çš„é—®ç­”åŠ©æ‰‹åˆ°èƒ½å¤Ÿè‡ªä¸»è§„åˆ’ã€è°ƒç”¨å·¥å ·ã€å®Œæˆå¤æ‚ä»»åŠ¡çš„æ™ºèƒ½ç³»ç»Ÿã€‚ä½†å¦‚ä½•ç§‘å­¦è¯„ä¼°ä¸€ä¸ªAgentçš„èƒ½åŠ›æ°´å¹³ï¼Ÿå¦‚ä½•ç¡®ä¿å®ƒåœ¨å®žé™ éƒ¨ç½²ä¸­è¡¨çŽ°å¯é ï¼Ÿæœ¬æ–‡å°†ç³»ç»Ÿä»‹ç»AI Agentçš„è¯„ä¼°ç»´åº¦ã€è‡ªåŠ¨åŒ–æµ‹è¯•æ¡†æž¶ä»¥åŠä¸»æµåŸºå‡†æµ‹è¯•æ–¹æ³•ï¼Œå¸®åŠ©å¼€å‘è€ æž„å»ºå®Œå–„çš„Agent评测体系。一、为什么Agentè¯„ä¼°å¦‚æ­¤å¤æ‚ä¸Žä¼ ç»Ÿè½¯ä»¶æˆ–å•ä¸€æ¨¡åž‹ä¸åŒï¼ŒAI Agentå ·æœ‰ä»¥ä¸‹ç‰¹ç‚¹ï¼Œä½¿å¾—è¯„ä¼°å æ»¡æŒ‘战:多轮交互性:Agentéœ€è¦åœ¨å¤šè½®å¯¹è¯ä¸­ä¿æŒçŠ¶æ€ä¸€è‡´æ€§å·¥å ·è°ƒç”¨èƒ½åŠ›ï¼šæ¶‰åŠå¤–éƒ¨APIã€æ•°æ®åº“ã€ä»£ç æ‰§è¡Œç­‰ heterogeneous å·¥å ·è‡ªä¸»è§„åˆ’ï¼šæ‰§è¡Œè·¯å¾„éžç¡®å®šæ€§ï¼ŒåŒä¸€ä»»åŠ¡å¯èƒ½æœ‰å¤šç§æ­£ç¡®è§£æ³•é•¿ç¨‹ä¾èµ–ï¼šä»»åŠ¡æ­¥éª¤é—´å­˜åœ¨å› æžœé“¾ï¼Œæ—©æœŸé”™è¯¯å¯èƒ½çº§è”æ”¾å¤§çŽ¯å¢ƒäº¤äº’ï¼šéœ€è¦ä¸ŽçœŸå®žæˆ–æ¨¡æ‹ŸçŽ¯å¢ƒåŠ¨æ€äº¤äº’å› æ­¤ï¼ŒAgent评估不能简单套用LLM的perplexity或BLEUæŒ‡æ ‡ï¼Œéœ€è¦è®¾è®¡æ›´å ¨é¢çš„è¯„ä¼°æ¡†æž¶ã€‚äºŒã€æ ¸å¿ƒè¯„ä¼°ç»´åº¦2.1 任务完成率(Task Success Rateï¼‰ä»»åŠ¡å®ŒæˆçŽ‡æ˜¯æœ€ç›´è§‚çš„è¯„ä¼°æŒ‡æ ‡ï¼Œè¡¡é‡Agent在给定任务上的成功率。from dataclasses import dataclass from typing import List, Optional, Any dataclass class TaskResult: 任务执行结果 task_id: str success: bool # 是否成功完成 completion_rate: float # 完成度(0.0-1.0) steps_taken: int # 执行步数 max_steps: int # æœ€å¤§å è®¸æ­¥æ•° time_elapsed: float # 耗时(秒) final_answer: str # 最终输出 gold_answer: str # æ ‡å‡†ç­”æ¡ˆ class TaskSuccessEvaluator: 任务完成率评估器 def __init__(self, tolerance: float 0.05): self.tolerance tolerance # 数值比较容差 def exact_match(self, predicted: str, expected: str) - bool: ç²¾ç¡®åŒ¹é  return predicted.strip() expected.strip() def contains_match(self, predicted: str, expected: str) - bool: åŒ å«åŒ¹é ï¼šé¢„æµ‹åŒ å«æ­£ç¡®ç­”æ¡ˆå³å¯ return expected.strip().lower() in predicted.strip().lower() def numeric_match(self, predicted: str, expected: str) - bool: æ•°å€¼åŒ¹é ï¼šæ”¯æŒå®¹å·®æ¯”è¾ƒ try: p_val float(predicted.replace(,, )) e_val float(expected.replace(,, )) return abs(p_val - e_val) / max(abs(e_val), 1e-10) self.tolerance except ValueError: return False def evaluate(self, results: List[TaskResult]) - dict: 评估一批任务结果 total len(results) success_count sum(1 for r in results if r.success) avg_completion sum(r.completion_rate for r in results) / total avg_steps sum(r.steps_taken for r in results) / total # æ•ˆçŽ‡æŒ‡æ ‡ï¼šæ˜¯å¦åœ¨åˆç†æ­¥æ•°å† å®Œæˆ efficient_count sum( 1 for r in results if r.success and r.steps_taken r.max_steps * 0.8 ) return { success_rate: success_count / total, avg_completion_rate: avg_completion, avg_steps: avg_steps, efficiency_rate: efficient_count / total, total_tasks: total } # 使用示例 results [