前言我22年解读过POWERTCP这篇论文论文对几种拥塞控制算法建模。从FAST TCP到POWERTCP。POWERTCP的仿真代码使用HPCC的仿真代码库项目名称为ns3-datacenter。基本概念pic sourcePFC交换机的缓冲区通常采用一个分层、共享的模型主要划分为以下三个逻辑区域Guaranteed buffer: A dedicated cache for each queue ensures that each queue has a certain amount of cache to ensure basic forwarding;Shared buffer: This is the cache that can be requested when traffic bursts occur, and is shared by all queues.Headroom: The cache that can continue to be used after the PFC waterline is triggered and before the server response slows down.专用缓冲区 (Guaranteed Buffer / Dedicated Buffer)为每个队列独占的、最小的可用缓存空间。保证每个队列无论有无损最基本的转发能力这部分空间即使空闲也不能被其他队列使用。HPCC SwitchMmu类中 对应 reserve (4KB) 的概念是每个端口/队列优先使用的固定空间。共享缓冲区 (Shared Buffer / Service Pool), 所有队列共享的公共缓冲池。当队列的专用缓冲区用完后可以继续申请使用共享缓冲区以吸收正常的流量突发。shared_used_bytes 和 GetPfcThreshold 所管理的区域。数据包超出 reserve 后会根据阈值放入共享区。Headroom 缓冲区 (Headroom Buffer)当共享缓冲区达到阈值Xoff阈值时新来的数据包会被存入Headroom。它的存在是为了吸收从交换机发出PFC暂停帧到上游设备真正停止发送数据这段时间内仍在途中的数据包确保绝不丢包。voidSwitchMmu::UpdateIngressAdmission(uint32_tport,uint32_tqIndex,uint32_tpsize){uint32_tnew_bytesingress_bytes[port][qIndex]psize;if(new_bytesreserve){ingress_bytes[port][qIndex]psize;}else{uint32_tthreshGetPfcThreshold(port);if(new_bytes-reservethresh){hdrm_bytes[port][qIndex]psize;}else{ingress_bytes[port][qIndex]psize;shared_used_bytesstd::min(psize,new_bytes-reserve);}}}ns3-datacenter switch-mmu.cc 中实现的内存管理策略更复杂。注释中有一段关于mmu的解释。仿真中把内存占用表示为计数器。The switch has an on-chip buffer which hasbufferPoolsize. This buffer is shared across all port and queues in the switch.bufferPoolis further split into multiple pools at the ingress and egress. It would be easier to understand from here on if you consider Ingress/Egress are merely just counters.触发pfcQbbNetDevice::Receive接收到数据包向上提交给SwitchNode主要目的更新mmu计数做队列管理。SwitchNode::SendToDev// This function can only be called in switch modeboolSwitchNode::SwitchReceiveFromDevice(PtrNetDevicedevice,PtrPacketpacket,CustomHeaderch){SendToDev(packet,ch);returntrue;}voidSwitchNode::SendToDev(PtrPacketp,CustomHeaderch){intidxGetOutDev(p,ch);if(idx0){NS_ASSERT_MSG(m_devices[idx]-IsLinkUp(),The routing table look up should return link that is up);// determine the qIndexuint32_tqIndex;if(ch.l3Prot0xFF||ch.l3Prot0xFE||(m_ackHighPrio(ch.l3Prot0xFD||ch.l3Prot0xFC))){//QCN or PFC or NACK, go highest priorityqIndex0;}else{qIndex(ch.l3Prot0x06?1:ch.udp.pg);// if TCP, put to queue 1}// admission controlFlowIdTag t;p-PeekPacketTag(t);uint32_tinDevt.GetFlowId();if(qIndex!0){//not highest priorityif(m_mmu-CheckIngressAdmission(inDev,qIndex,p-GetSize())m_mmu-CheckEgressAdmission(idx,qIndex,p-GetSize())){// Admission controlm_mmu-UpdateIngressAdmission(inDev,qIndex,p-GetSize());m_mmu-UpdateEgressAdmission(idx,qIndex,p-GetSize());}else{return;// Drop}CheckAndSendPfc(inDev,qIndex);}m_bytes[inDev][idx][qIndex]p-GetSize();m_devices[idx]-SwitchSend(qIndex,p,ch);}elsereturn;// Drop}GetOutDev 查询路由表m_rtTable。SwitchNode::CheckAndSendPfcvoidSwitchNode::CheckAndSendPfc(uint32_tinDev,uint32_tqIndex){PtrQbbNetDevicedeviceDynamicCastQbbNetDevice(m_devices[inDev]);if(m_mmu-CheckShouldPause(inDev,qIndex)){device-SendPfc(qIndex,0);m_mmu-SetPause(inDev,qIndex);}}boolSwitchMmu::CheckShouldPause(uint32_tport,uint32_tqIndex){return!paused[port][qIndex](hdrm_bytes[port][qIndex]0||GetSharedUsed(port,qIndex)GetPfcThreshold(port));}PFC 暂停帧需要同时满足以下两个大条件当前未处于暂停状态 (!paused[port][qIndex]) 。防止重复发送 PFC 帧避免无效通知。触发警戒条件满足以下任意一条即可条件 AHeadroom 已被占用 (hdrm_bytes[port][qindex] 0)表示该队列的数据包已经溢出到专门为 PFC 预留的“头空间Headroom”中情况紧急必须立即暂停对端发送。条件 B共享缓冲区使用量达到动态阈值 (GetSharedUsed(port, qIndex) GetPfcThreshold(port))。该队列在共享缓存区中的占用已触及危险水位线需要提前发送 PFC防止后续数据包涌入 Headroom。PFC是向上游发送也就是数据包的来源方向(inDev)。QbbNetDevice::SendPfcvoidQbbNetDevice::SendPfc(uint32_tqIndex,uint32_ttype){PtrPacketpCreatePacket(0);PauseHeaderpauseh((type0?m_pausetime:0),m_queue-GetNBytes(qIndex),qIndex);p-AddHeader(pauseh);Ipv4Header ipv4h;// Prepare IPv4 headeripv4h.SetProtocol(0xFE);ipv4h.SetSource(m_node-GetObjectIpv4()-GetAddress(m_ifIndex,0).GetLocal());ipv4h.SetDestination(Ipv4Address(255.255.255.255));ipv4h.SetPayloadSize(p-GetSize());ipv4h.SetTtl(1);ipv4h.SetIdentification(UniformVariable(0,65536).GetValue());p-AddHeader(ipv4h);AddHeader(p,0x800);CustomHeaderch(CustomHeader::L2_Header|CustomHeader::L3_Header|CustomHeader::L4_Header);p-PeekHeader(ch);SwitchSend(0,p,ch);}boolQbbNetDevice::SwitchSend(uint32_tqIndex,PtrPacketpacket,CustomHeaderch){m_macTxTrace(packet);m_traceEnqueue(packet,qIndex);m_queue-Enqueue(packet,qIndex);DequeueAndTransmit();returntrue;}SwitchSend(0, p, ch) 把 pfc报文发送到 qIndex 0 对应的Queue中。QbbNetDevice::DequeueAndTransmit关于Switch节点的处理逻辑pm_queue-DequeueRR(m_paused);//this is round-robinif(p!0){m_snifferTrace(p);m_promiscSnifferTrace(p);Ipv4Header h;PtrPacketpacketp-Copy();uint16_tprotocol0;ProcessHeader(packet,protocol);packet-RemoveHeader(h);FlowIdTag t;uint32_tqIndexm_queue-GetLastQueue();if(qIndex0){//this is a pause or cnp, send it immediately!m_node-SwitchNotifyDequeue(m_ifIndex,qIndex,p);p-RemovePacketTag(t);}else{m_node-SwitchNotifyDequeue(m_ifIndex,qIndex,p);p-RemovePacketTag(t);}m_traceDequeue(p,qIndex);TransmitStart(p);return;}QbbNetDevice向channel中发送数据包前调用 SwitchNode::SwitchNotifyDequeue。SwitchNotifyDequeue函数主要功能更新 MMU内存管理单元状态。可选的 ECN显式拥塞通知标记。PFC 恢复Resume检查。调用 CheckAndSendResume(inDev, qIndex)内部逻辑为若 m_mmu-CheckShouldResume(inDev, qIndex) 返回 true即 Headroom 已清空且共享使用量低于阈值则向入端口发送 PFC Resume 帧device-SendPfc(qIndex, 1)。并调用 m_mmu-SetResume(inDev, qIndex) 清除暂停状态。拥塞控制与 INT/HPCC 遥测信息更新。QbbNetDevice::TransmitStart 往上游发送packet传递给channel。boolQbbNetDevice::TransmitStart(PtrPacketp){NS_LOG_FUNCTION(thisp);NS_LOG_LOGIC(UID is p-GetUid()));//// This function is called to start the process of transmitting a packet.// We need to tell the channel that weve started wiggling the wire and// schedule an event that will be executed when the transmission is complete.//NS_ASSERT_MSG(m_txMachineStateREADY,Must be READY to transmit);m_txMachineStateBUSY;m_currentPktp;m_phyTxBeginTrace(m_currentPkt);Time txTimeSeconds(m_bps.CalculateTxTime(p-GetSize()));Time txCompleteTimetxTimem_tInterframeGap;NS_LOG_LOGIC(Schedule TransmitCompleteEvent in txCompleteTime.GetSeconds()sec);Simulator::Schedule(txCompleteTime,QbbNetDevice::TransmitComplete,this);boolresultm_channel-TransmitStart(p,this,txTime);if(resultfalse){m_phyTxDropTrace(p);}returnresult;}假设switch中的QbbNetDeviceB的内存占用超过阈值触发PFC。PFC报文通过channel发送到QbbNetDeviceA。ns3中的数据包仿真模型接收到PFC报文QbbNetDeviceA的处理逻辑。QbbNetDevice::ReceivevoidQbbNetDevice::Receive(PtrPacketpacket){NS_LOG_FUNCTION(thispacket);if(!m_linkUp){m_traceDrop(packet,0);return;}if(m_receiveErrorModelm_receiveErrorModel-IsCorrupt(packet)){//// If we have an error model and it indicates that it is time to lose a// corrupted packet, dont forward this packet up, let it go.//m_phyRxDropTrace(packet);return;}m_macRxTrace(packet);CustomHeaderch(CustomHeader::L2_Header|CustomHeader::L3_Header|CustomHeader::L4_Header);ch.getInt1;// parse INT headerpacket-PeekHeader(ch);if(ch.l3Prot0xFE){// PFCif(!m_qbbEnabled)return;unsignedqIndexch.pfc.qIndex;if(ch.pfc.time0){m_tracePfc(1);m_paused[qIndex]true;}else{m_tracePfc(0);Resume(qIndex);}}else{// non-PFC packets (data, ACK, NACK, CNP...)if(m_node-GetNodeType()0){// switchpacket-AddPacketTag(FlowIdTag(m_ifIndex));m_node-SwitchReceiveFromDevice(this,packet,ch);}else{// NIC// send to RdmaHwintretm_rdmaReceiveCb(packet,ch);// TODO we may based on the ret do something}}return;}将m_paused[qIndex]设置为true暂停发送。qIndex是数据流传输使用的优先级。QbbNetDeviceA恢复数据包的发送需要等待QbbNetDeviceB发送的resume报文。Reference[1] Notes on Flow Control for High Speed Networks[2] Switch Buffering Architecture and Packet Scheduling Algorithms[3] ns3-datacenter switch-mmu.cc