发现真理-FRSMASH 全维度消融实验报告(完整版)
一、实验设计核心假设:记忆与逻辑是两个独立可调的维度记忆能力 ← d_model(状态向量宽度) SlowMemory(内容门控) 逻辑能力 ← OpenASH 层数(L) cummax/gen_model Fast 层 ← 替代 OpenASH 的轻量线性递推(B88 时的替代选择)实验矩阵(8组 × 33项)组固定变量步数验证问题A. 逻辑轴H512L2,4,6,83000层数↑→loss↓多少?B. 记忆轴L4H256,384,512,6403000宽度↑→loss↓多少?C. 组件消融H512完整/去ASH/去Slow3000哪个组件贡献大?D. 快慢比d512(HybridFRSM)3F1S/2F2S/1F1S/0F1S3000最优F/S比?E. 混合比H512 L44F/3F1A/2F2A/1F3A/0F4A3000Fast替代OpenASH掉多少?F. K值H512 L4K1,2,4,8,16,∞1500慢记忆更新频率最优值? (含CopyFirst)G. NS纯慢(0F)NS1,2,41500慢尺度数量帮助多大? (含CopyFirst)H. 记忆消融H512 L4 K8完整/去Slow/去ASH1500CopyFirst 位置PPL对比实验条件: RTX 4090 D, pretrain_t2t_mini 30000行, T384, B64, AdamW lr5e-4二、全部实验结果A. 逻辑轴 (H512, NS1, K8)L参数lossΔ/L耗时229.6M3.504—917s433.3M3.221-0.282177s636.9M3.061-0.161685s840.6M2.922-0.143204sB. 记忆轴 (L4, NS1, K8)H参数lossΔ/H耗时25614.2M3.778—383s38423.1M3.474-0.30442s51233.3M3.229-0.25558s64044.6M3.014-0.22660sC. 组件消融 (H512, L4)配置loss耗时解读完整3.213557s基准去Slow(纯OpenASH)3.223382s0.01,OpenASH贡献0.35去OpenASH(纯Slow)3.5604716s0.35,Slow贡献仅0.01D. 快慢比 (HybridFRSM, d512)配置loss耗时3F1S4.7191622s2F2S4.6912153s1F1S4.7192237s0F1S(≈V6)5.4012046sHybridFRSM 整体不如 FRSMASH(4.7 vs 3.2),纯慢(0F1S)最差。E. 混合比 (FRSMASH Hybrid, H512, L4, B64)配置loss耗时vs 0F4A4F0A3.6251692s0.413F1A3.4621872s0.252F2A3.2931746s0.081F3A3.3071434s0.090F4A3.215566s基准B64时纯OpenASH双最优。2F2A仅差0.08,大batch时可替代纯OpenASH。F. 慢记忆 K 值 (H512, L4, 含CopyFirst)Klossmem_score耗时13.78-6.205942s23.16-4.593454s43.75-5.501785s83.76-5.07980s163.77-5.00552s∞(不更新)3.76-3.81206sK2:loss最低(3.16)且记忆最强(-4.59)。K1反而差:更新太频繁,噪声淹没有用信息。G. 慢尺度数量 (纯慢,0F, K8)NSlossmem_score耗时15.64-5.77353s25.64-5.361128s45.64-5.99708s没有OpenASH,加多少慢尺度都救不了 loss(5.64)。记忆和逻辑必须配合。H. 记忆任务消融 (H512, L4, 含CopyFirst 位置PPL)CopyFirst 对比:配置lossmem_score耗时完整(K8)3.77-4.74499s去Slow(K∞)3.78-5.10211s去OpenASH(纯Slow)3.95-5.848760sloss差0.01,但mem_score差0.36 —LM loss测不出记忆,但CopyFirst测得出。位置分段 PPL(200步训练,对比 K2 vs K∞):模型near(0-64)mid(128-192)far(320-384)far/nearstabilityK2(完整)109.6126.4174.01.59x0.479NoSlow(K∞)112.9123.3147.01.30x0.731K2近端更强(109 vs 113),但远端退化更多(1.59x vs 1.30x)。训练速度全对比模型参数B64 tok/sB88 tok/s推理 tok/sFRSMASH H512 L433M62K8.3K(OOM边缘)247FRSMASH-F 4F0A33M63K52.7K324HybridFRSM d1024 3F1S100M—40K152Dense MoE C20100M—27K—原 Sparse MoE102M—5K—三、核心发现1. 记忆-逻辑解耦被实验证实A组(L2→L8): 逻辑深度↑ → loss 3.50→2.92 (-0.58) B组(H256→H640): 记忆宽度↑ → loss 3.78→3.01 (-0.76) 两条独立轴,各自贡献,可独立调参2. OpenASH backbone 是 FRSMASH 的灵魂贡献 0.35 loss(C组)cummax 5-branch gen_model 提供极强的 LM 先验Slow memory 在 74M token 时对 LM loss 仅贡献 0.013. Slow memory 在工作中——LM loss 看不出来CopyFirst: 完整 去Slow(mem_score -4.74 vs -5.10,差距 0.36)LM loss: 完整 ≈ 去Slow(3.77 vs 3.78,差距 0.01)结论:LM loss 不是记忆的度量,CopyFirst才是4. K2 是魔术数字,但有代价优势:loss 最低(3.16 vs 3.76)、CopyFirst 最强(-4.59 vs -3.81)代价:远端PPL退化更多(1.59x vs 1.30x),192次更新累积门控噪声K8 是工程平衡:loss 接近 K2,速度快 3.5x,远端更稳定5. B64 时纯 OpenASH 最优; B88 时 2F2A 是唯一解B64: 0F4A loss3.22, 566sB88: 0F4A OOM →2F2A loss3.29(0.07),可正常训练四、最佳架构推荐场景配置loss(预测)参数量说明 综合最优FRSMASH H512 L8 K82.9240.6M最强逻辑,8层OpenASH性价比最优FRSMASH H512 L4 K83.2233.3M训练时间减半大batch训练FRSMASH Hybrid 2F2A3.2933.3MB88能跑,OpenASH已OOM记忆最优FRSMASH H512 L4 K23.1633.3MCopyFirst最强,loss最低推理部署FRSMASH-F H512 L4—33.3M推理324 tok/s,最快省参数FRSMASH H256 L43.7814.2M14M参数,快速验证结论:FRSMASH H512 L8 K8 是当前全维度最优架构。五、结论FRSMASH 的记忆-逻辑解耦理论被实验证实(A/B组独立验证)OpenASH cummax 是核心引擎(贡献 0.35 loss),Slow memory 在 LM loss 上贡献微小但在记忆中不可替代K8 是工程最优(loss/速度/稳定性的平衡),K2 适合追求极致记忆和 loss 的场景Fast 层的价值在大 batch(B88),B64 时纯 OpenASH 最优LM loss 测不出记忆能力—必须用 CopyFirst 或位置 PPL 等记忆专项任务附录: FRSMASH 完整模型代码文件:frsmash.py FRSMASH — OpenASH 骨干 1 慢尺度记忆 架构: 1. 共享 embedding 2. OpenASH 多层骨干 (cummax gen_model FFN) → 强 LM 特征 3. 慢尺度记忆 (内容门控, 每 K 步更新) → 选择性长期记忆 4. 门控融合: per-token 决定依赖 LM 还是记忆 importtorchimporttorch.nnasnnimporttorch.nn.functionalasF# # 1. OpenASH 组件# classMaxStateSuper(nn.Module):OpenASH 核心: 多头 cummax gen_modeldef__init__(self,dim_size,heads,model_flagtrain):super().__init__()self.headsheads self.d_headdim_size//heads self.model_flagmodel_flag self.combinednn.Linear(dim_size,4*dim_size,biasFalse)self.alpha1nn.Parameter(torch.tensor(0.5))self.alpha2nn.Parameter(torch.tensor(0.5))self.alpha3nn.Parameter(torch.tensor(0.5))self.head_linearnn.Linear(heads*5,heads,biasFalse)defforward(self,x,stateNone):b,s,dx.shape combinedself.combined(x).view(b,s,4,self.heads,-1)out,out1,out2,out3combined.unbind(2)outout.permute(0,3,1,2)out1out1.permute(0,3,1,2)out2out2.permute(0,3,1,2)out3out3.permute(0,3,1,2)ifstateisNone:out4,_torch.cummax(out2,dim2)stateout4[:,:,-1:]else:out4,_torch.cummax(torch.cat([state,out2],dim2),dim2)ifself.model_flagtrain:out4out4[:,:,1:]else:out4out4[:,:,-1:]stateout4[:,:,-1:]cattorch.cat([out,out1,out2,out3,out4],dim-1)combined_gself.head_linear(cat)*out4 term1out*out1 term2self.alpha1*out1self.alpha2*out3 term3out*(self.alpha3*out4out3)term4out1*(out2out4)resultterm1term2term3term4out2*out4combined_g out_lresult.transpose(1,2).contiguous().view(b,s,d)returnout_l,stateclassFeedForward(nn.Module):def__init__(self,hidden_size):super().__init__()self.ffn1nn.Linear(hidden_size,hidden_size)self.ffn2nn.Linear(hidden_size,hidden_size)self.gatenn.Linear(hidden_size,hidden_size)self.relunn.ReLU()defforward(self,x):returnself.ffn2(self.ffn1(x)*self.relu(self.gate(x)))classASHDecoderLayer(nn.Module):def__init__(self,hidden_size,num_heads,model_flagtrain):super().__init__()self.attnMaxStateSuper(hidden_size,num_heads,model_flag)self.ffnFeedForward(hidden_size)self.normnn.LayerNorm(hidden_size)self.alphann.Parameter(torch.tensor(0.5))defforward(self,x,stateNone):x1,stateself.attn(x,state)xself.norm(self.alpha*self.ffn(x1)(1-self.alpha)*x)returnx,state# # 2. 慢尺度记忆# classSlowMemoryCell(nn.Module): 内容门控慢记忆 — 选择性写入 h_new α * candidate (1-α) * h_prev α sigmoid(MLP([h_prev; inp])) ← 内容决定写入强度 def__init__(self,d_model):super().__init__()dd_model self.W_forgetnn.Linear(d*2,d)self.W_inputnn.Linear(d*2,d)self.W_candnn.Linear(d*2,d)nn.init.constant_(self.W_forget.bias,1.0)nn.init.constant_(self.W_input.bias,-2.0)dhmax(d//4,1)self.gatenn.Sequential(nn.Linear(d*2,dh),nn.GELU(),nn.Linear(dh,1),nn.Sigmoid())defforward(self,x_t,h_prev):ctorch.cat([h_prev,x_t],dim-1)ftorch.sigmoid(self.W_forget(c))itorch.sigmoid(self.W_input(c))candf*h_previ*torch.tanh(self.W_cand(c))alphaself.gate(c).squeeze(-1).unsqueeze(-1)returnalpha*cand(1-alpha)*h_prev# # 3. FRSMASH — 融合模型# classFRSMASH(nn.Module): FRSMASH OpenASH backbone 1 SlowMemory 参数: voc_size: 词表大小 hidden_size: 隐藏维度 num_heads: 注意力头数 num_layers: OpenASH 层数 K: 慢尺度更新周期 (默认 8) def__init__(self,voc_size,hidden_size,num_heads,num_layers,K8):super().__init__()self.Dhidden_size self.KK self.emnn.Embedding(voc_size,hidden_size,padding_idx0)self.ash_layersnn.ModuleList([ASHDecoderLayer(hidden_size,num_heads,train)for_inrange(num_layers)])self.ash_normnn.LayerNorm(hidden_size)self.mem_input_projnn.Linear(hidden_size,hidden_size)self.slow_cellSlowMemoryCell(hidden_size)self.mem_projnn.Linear(hidden_size,hidden_size)self.fusion_gatenn.Sequential(nn.Linear(hidden_size*2,hidden_size//4),nn.GELU(),nn.Linear(hidden_size//4,1),nn.Sigmoid())self.fusion_normnn.LayerNorm(hidden_size)self.headnn.Linear(hidden_size,voc_size,biasFalse)defforward(self,x):B,Tx.shape;Dself.D x_embself.em(x)hx_embforlayerinself.ash_layers:h1,_layer(h)hh1h x_ashself.ash_norm(h)inp_seqself.mem_input_proj(x_emb)h_slowtorch.zeros(B,D,devicex.device)H_slowtorch.zeros(B,T,D,devicex.device)prev0fortinrange(0,T,self.K):h_slowself.slow_cell(inp_seq[:,t],h_slow)H_slow[:,prev:t1]h_slow.unsqueeze(1)prevt1ifprevT:H_slow[:,prev:]h_slow.unsqueeze(1)x_memself.mem_proj(H_slow)cattorch.cat([x_ash,x_mem],dim-1)gateself.fusion_gate(cat)fusedself.fusion_norm(gate*x_ash(1-gate)*x_memx_emb)returnself.head(fused)torch.no_grad()defgenerate_step(self,token_id,ash_states,h_slow):Btoken_id.size(0)xself.em(token_id)hx new_states[]fori,layerinenumerate(self.ash_layers):layer.attn.model_flaginferh1,slayer.attn(h,ash_states[i])h1layer.norm(layer.alpha*layer.ffn(h1)(1-layer.alpha)*h)hh1h new_states.append(s)x_ashself.ash_norm(h[:,0])inpself.mem_input_proj(x[:,0])h_slow_newself.slow_cell(inp,h_slow)x_memself.mem_proj(h_slow_new)cattorch.cat([x_ash,x_mem],dim-1)gateself.fusion_gate(cat)fusedself.fusion_norm(gate*x_ash(1-gate)*x_memx[:,0])logitsself.head(fused)returnlogits,new_states,h_slow_new模型文件索引文件内容frsmash.pyFRSMASH 完整模型(OpenASH Slow,最佳架构)frsmash_f.pyFRSMASH-F(Fast 线性层替代 cummax)frsm_linear.pyHybridFRSM(快慢尺度分离)frsm_v6a_fast.py原始 V6(4尺度内容门控,串行)frsm_v6_moe/frsm_v6a_dense_moe.pyDense MoE(16专家共享)frsm_v6_moe/train_dense_moe.pyDense MoE 训练脚本frsm_v6_moe/ablation.py消融实验 A-Efrsm_v6_moe/ablation_memory.py消融实验 F-H(记忆专项)