Nginx变更导致网关请求均响应400
项目上使用SpringCloudGateway作为网关承接公网上各个业务线进来的请求流量在网关的前面有两台Nginx反向代理了网关网关做了一系列的前置处理后转发请求到后面各个业务线的服务简要的网络链路为网关域名(wmg.test.com) - ... - Nginx -F5(硬负载域名fp.wmg.test) - 网关 - 业务系统某天负责运维Nginx的团队要增加两台新的Nginx机器原因说来话长按下不表使用两台新的Nginx机器替代掉原先反向代理网关的两台Nginx。SRE等级定性P1一个月黑风高的夜晚负责运维Nginx的团队进行了生产变更在两台新机器上部署了Nginx然后让网络团队将网关域名的流量切换到了两台新的Nginx机器上刚切换完立马有业务线团队的人反应过网关的接口请求都变成400了。负责运维Nginx的团队又让网络团队将网关域名流量切回到原有的两台Nginx上业务线过网关的接口请求恢复正常持续了两分多钟SRE等级定性P1。负责运维Nginx的团队说两台新的Nginx配置和原有的两台Nginx配置一样看不出什么问题找到我让我从网关排查有没有什么错误日志。不太可能吧如果新的两台Nginx配置和原有的两台Nginx配置一样的话不会出现请求都是400的问题啊我心想不过还是去看了网关上的日志在那个时间段网关没有错误日志出现。看了下新Nginx的日志Options请求正常返回204其它的GET、POST请求都是400Options是预检请求在Nginx层面就处理返回了新Nginx的日志示例如下10.x.x.x:63048 - 10.x.x.x:8099 [2025-07-17T10:36:2608:00] 10.x.x.x:8099 OPTIONS /api/xxx HTTP/1.1 204 0 https://domain/ Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 - [req_time:0.000 s] [upstream_connect_time:- s] [upstream_header_time:- s] [upstream_resp_time:- s] [-] 10.x.x.x:63048 - 10.x.x.x:8099 [2025-07-17T10:36:2608:00] 10.x.x.x:8099 POST /api/xxx HTTP/1.1 400 0 https://domain/ Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 - [req_time:0.001 s] [upstream_connect_time:0.000 s] [upstream_header_time:0.001 s] [upstream_resp_time:0.001 s] [10.x.x.x:8082]去找了网络团队从流量回溯设备上看到400确实是网关返回的还没有到后面的业务系统400代表BadRequest我怀疑是不是请求体的问题想让网络将那个时间段的流量包数据取下来分析网络没给只给我了业务报文参数走网关请求的业务参数报文是加密的我本地运行程序可以正常解密报文我反馈给了负责运维Nginx的团队。负责运维Nginx的团队又花了一段时间定位问题还是没有头绪又找到我让我帮忙分析调查下。介入调查我说测试环境地址是啥我先在测试环境看下能不能复现负责运维Nginx的团队成员说没有在测试环境搭建测试这一次变更是另一个成员直接生产变更。我要来了新的Nginx配置文件和老的Nginx配置文件比对了下发现有不一样的地方老Nginx上反向代理网关的配置如下server { listen 8080; server_name wmg.test.com; add_header X-Frame-Options SAMEORIGIN; add_header X-Content-Type-Options nosniff; add_header Content-Security-Policy frame-ancestors self; location / { proxy_hide_header host; client_max_body_size 100m; add_header Access-Control-Allow-Origin $http_origin always; add_header Access-Control-Allow-Credentials true always; add_header Access-Control-Allow-Methods GET, POST, OPTIONS, DELETE, PUT; add_header Access-Control-Allow-Headers ...; if ($request_method OPTIONS) { return 204; } proxy_pass http://fp.wmg.test:8090; } }新Nginx配置如下upstream http_gateways{ server fp.wmg.test:8090; keepalive 30; } server { listen 8080 backlog512; server_name wmg.test.com; add_header X-Frame-Options SAMEORIGIN; add_header X-Content-Type-Options nosniff; add_header Content-Security-Policy frame-ancestors self; location / { proxy_hide_header host; proxy_http_version 1.1; proxy_set_header Connection ; client_max_body_size 100m; add_header Access-Control-Allow-Origin $http_origin always; add_header Access-Control-Allow-Credentials true always; add_header Access-Control-Allow-Methods GET, POST, OPTIONS, DELETE, PUT; add_header Access-Control-Allow-Headers ...; if ($request_method OPTIONS) { return 204; } proxy_pass http://http_gateways; } }新Nginx代理网关的配置与原有Nginx上的配置区别在于使用upstream配置了网关的F5负载均衡地址upstream http_gateways{ server fp.wmg.test:8090; keepalive 30; }设置http协议为1.1启用长连接proxy_http_version 1.1; proxy_set_header Connection ;我让负责运维Nginx的团队在测试环境的Nginx上按照新的Nginx配置模拟了生产环境Nginx10.100.8.11 监听9104端口网关10.100.22.48 监听8081端口Nginx的9104端口转发到网关的8081端口配置如下upstream http_gateways{ server 10.100.22.48:8081; keepalive 30; } server { listen 9104 backlog512; server_name localhost; add_header X-Frame-Options SAMEORIGIN; add_header X-Content-Type-Options nosniff; add_header Content-Security-Policy frame-ancestors self; location / { proxy_hide_header host; proxy_http_version 1.1; proxy_set_header Connection ; client_max_body_size 100m; add_header Access-Control-Allow-Origin $http_origin always; add_header Access-Control-Allow-Credentials true always; add_header Access-Control-Allow-Methods GET, POST, OPTIONS, DELETE, PUT; add_header Access-Control-Allow-Headers ...; if ($request_method OPTIONS) { return 204; } proxy_pass http://http_gateways; } }问题复现通过Nginx请求网关到后端服务接口问题复现请求响应400curl -v -X GET http://10.100.8.11:9104/wechat-web/actuator/info去掉下面的两个配置请求正常响应200proxy_http_version 1.1; proxy_set_header Connection ;天外来锅将这个现象反馈给了负责运维Nginx的团队结果负责运维Nginx的团队查了半天说网关不支持长连接要让网关改造。不应该啊以往网关发版的时候是滚动发版的F5上先下掉一个机器的流量停启这个机器上的网关服务然后F5上流量F5下流量的时候是有长连接存在的每次都会等个5分钟左右才能下掉一路的流量。得先放下手头的工作花点时间来证明网关是支持长连接的。在Nginx机器上通过命令行指定长连接方式访问网关请求后端服务接口wget -d --headerConnection: keepalive http://10.100.22.48:8081/wechat-web/actuator/info http://10.100.22.48:8081/wechat-web/actuator/info http://10.100.22.48:8081/wechat-web/actuator/info回车出现如下日志Setting --header (header) to Connection: keepalive DEBUG output created by Wget 1.14 on linux-gnu. URI encoding ‘UTF-8’ Converted file name info (UTF-8) - info (UTF-8) Converted file name info (UTF-8) - info (UTF-8) --2025-07-17 13:45:08-- http://10.100.22.48:8081/wechat-web/actuator/info Connecting to 10.100.22.48:8081... connected. Created socket 3. Releasing 0x0000000000c95a90 (new refcount 0). Deleting unused 0x0000000000c95a90. ---request begin--- GET /wechat-web/actuator/info HTTP/1.1 User-Agent: Wget/1.14 (linux-gnu) Accept: */* Host: 10.100.22.48:8081 Connection: keepalive ---request end--- HTTP request sent, awaiting response... ---response begin--- HTTP/1.1 200 OK transfer-encoding: chunked Content-Type: application/vnd.spring-boot.actuator.v3json Date: Thu, 17 Jul 2025 05:25:34 GMT ---response end--- 200 OK Registered socket 3 for persistent reuse. Length: unspecified [application/vnd.spring-boot.actuator.v3json] Saving to: ‘info’ [ ] 83 --.-K/s in 0s 2025-07-17 13:45:08 (7.75 MB/s) - ‘info’ saved [83] URI encoding ‘UTF-8’ Converted file name info (UTF-8) - info (UTF-8) Converted file name info (UTF-8) - info (UTF-8) --2025-07-17 13:45:08-- http://10.100.22.48:8081/wechat-web/actuator/info Reusing existing connection to 10.100.22.48:8081. Reusing fd 3. ---request begin--- GET /wechat-web/actuator/info HTTP/1.1 User-Agent: Wget/1.14 (linux-gnu) Accept: */* Host: 10.100.22.48:8081 Connection: keepalive ---request end--- HTTP request sent, awaiting response... ---response begin--- HTTP/1.1 200 OK transfer-encoding: chunked Content-Type: application/vnd.spring-boot.actuator.v3json Date: Thu, 17 Jul 2025 05:25:34 GMT ---response end--- 200 OK Length: unspecified [application/vnd.spring-boot.actuator.v3json] Saving to: ‘info.1’ [ ] 83 --.-K/s in 0s 2025-07-17 13:45:08 (9.47 MB/s) - ‘info.1’ saved [83] URI encoding ‘UTF-8’ Converted file name info (UTF-8) - info (UTF-8) Converted file name info (UTF-8) - info (UTF-8) --2025-07-17 13:45:08-- http://10.100.22.48:8081/wechat-web/actuator/info Reusing existing connection to 10.100.22.48:8081. Reusing fd 3. ---request begin--- GET /wechat-web/actuator/info HTTP/1.1 User-Agent: Wget/1.14 (linux-gnu) Accept: */* Host: 10.100.22.48:8081 Connection: keepalive ---request end--- HTTP request sent, awaiting response... ---response begin--- HTTP/1.1 200 OK transfer-encoding: chunked Content-Type: application/vnd.spring-boot.actuator.v3json Date: Thu, 17 Jul 2025 05:25:34 GMT ---response end--- 200 OK Length: unspecified [application/vnd.spring-boot.actuator.v3json] Saving to: ‘info.2’ [ ] 83 --.-K/s in 0s 2025-07-17 13:45:08 (11.1 MB/s) - ‘info.2’ saved [83] FINISHED --2025-07-17 13:45:08-- Total wall clock time: 0.1s Downloaded: 3 files, 249 in 0s (9.25 MB/s)