从零构建 AI 学术论文助手(二):PDF 解析与 RAG 检索核心实现
系列文章第二篇。本篇讲最核心的部分怎么把 PDF 切成有意义的块怎么建向量索引以及怎么在多文献场景下保证检索质量。一、PDF 解析的难题直接调用pdfplumber或pdfminer提取 PDF 文本拿到的通常是这样的东西Abstract This paper presents a novel approach... 1. Introduction The field of natural language... Table 1. Results on GLUE benchmark... [1] Vaswani, A., et al. Attention is All You Need... [2] Devlin, J., et al. BERT: Pre-training of Deep...问题一堆章节标题和正文混在一起不知道哪里是一章的开头参考文献混入正文检索因为引用了大量关键词检索时频繁命中中英文混合 PDF有不同的提取策略切块大小不对块太小则上下文不足块太大则一个块占满 top_k 配额二、英文论文基于字体的章节识别英文学术论文通常遵循 IMRaD 格式Abstract / Introduction / Method / Results / Discussion章节标题字号比正文大或全部加粗。利用PyMuPDF的 dict 模式可以拿到每个文字块的字体信息import fitz # PyMuPDF def extract_sections_en(pdf_path: str) - list[dict]: doc fitz.open(pdf_path) blocks_all [] for page_num, page in enumerate(doc): blocks page.get_text(dict)[blocks] for block in blocks: if block.get(type) ! 0: # 0 文字块 continue for line in block.get(lines, []): for span in line.get(spans, []): blocks_all.append({ text: span[text].strip(), size: span[size], # 字号 bold: Bold in span.get(font, ), page: page_num 1, }) # 统计正文字号众数 from collections import Counter sizes [b[size] for b in blocks_all if len(b[text]) 20] body_size Counter(sizes).most_common(1)[0][0] if sizes else 10.0 # 判断标题行字号 正文 15% 或全行加粗且匹配章节关键词 SECTION_KEYWORDS { abstract, introduction, related work, background, methodology, method, approach, experiment, result, discussion, conclusion, reference, acknowledgement } sections [] current_section Header current_text [] for b in blocks_all: text_lower b[text].lower().strip(0123456789. \t) is_heading ( b[size] body_size * 1.15 or (b[bold] and text_lower in SECTION_KEYWORDS) ) if is_heading and len(b[text]) 80: if current_text: sections.append({ section: current_section, text: .join(current_text) }) current_section b[text].strip() current_text [] else: current_text.append(b[text]) if current_text: sections.append({section: current_section, text: .join(current_text)}) return sections章节名双语化为了让用中文提问也能命中英文章节章节名存储时转为「English中文」格式SECTION_CN_MAP { abstract: 摘要, introduction: 引言, related work: 相关工作, methodology: 研究方法, method: 方法, experiment: 实验, result: 结果, discussion: 讨论, conclusion: 结论, reference: 参考文献, } def bilingual_section(name: str) - str: key name.lower().strip(0123456789. ) cn SECTION_CN_MAP.get(key) if cn: return f{name.title()}{cn} return name效果Abstract→Abstract摘要用户问「摘要说了什么」能正确命中。三、参考文献过滤参考文献章节是 RAG 的噪音源——它包含大量论文标题、作者名和关键词会频繁被检索命中但对用户问题没有实质帮助。NOISE_SECTIONS { references, bibliography, acknowledgements, acknowledgments, appendix, 参考文献, 致谢, 附录 } def is_noise_section(section_name: str) - bool: name section_name.lower().strip(0123456789. \t) return any(kw in name for kw in NOISE_SECTIONS)在入库时直接跳过这些 section 的 chunks大幅提升检索精度。四、两级分块策略子块 父块单一分块大小是矛盾的检索要小块向量相似度匹配更精准回答要大块上下文越完整LLM 回答越好解决方案子块做向量化检索命中后扩展到父块连续N个子块传给 LLM。子块1 子块2 子块3 子块4 子块5 子块6 子块7 子块8 └──────── 父块A ────────┘ └──────── 父块B ────────┘# config.py CHILD_CHUNK_ZH 280 # 中文子块字符数 CHILD_CHUNK_EN 480 # 英文子块字符数 CHUNK_OVERLAP_ZH 55 # 滑动窗口重叠~19% CHUNK_OVERLAP_EN 90 # 滑动窗口重叠~18% PARENT_WINDOW 3 # 父块 连续3个子块def make_chunks(text: str, doc_id: str, section: str, is_zh: bool) - list[dict]: size CHILD_CHUNK_ZH if is_zh else CHILD_CHUNK_EN overlap CHUNK_OVERLAP_ZH if is_zh else CHUNK_OVERLAP_EN chunks [] start 0 idx 0 while start len(text): end start size chunk_text text[start:end] chunks.append({ doc_id: doc_id, section: section, chunk_id: idx, text: chunk_text, }) start size - overlap idx 1 return chunks检索时命中子块后扩展到父块def expand_hits_to_parent(hits: list[dict], window: int 3) - list[dict]: 把命中的子块扩展为连续 window 个子块的父块 from services.embedder import get_chunks_range expanded [] seen set() for hit in hits: doc_id hit[doc_id] chunk_id hit[chunk_id] parent_start max(0, chunk_id - window // 2) parent_end parent_start window key (doc_id, parent_start) if key in seen: continue seen.add(key) parent_chunks get_chunks_range(doc_id, parent_start, parent_end) parent_text .join(c[text] for c in parent_chunks) expanded.append({ **hit, text: parent_text, is_parent: True, }) return expanded五、Jina AI 向量化入库import requests JINA_API_KEY os.getenv(JINA_API_KEY, ) JINA_EMBED_URL https://api.jina.ai/v1/embeddings def embed_texts(texts: list[str]) - list[list[float]]: headers { Authorization: fBearer {JINA_API_KEY}, Content-Type: application/json, } payload { model: jina-embeddings-v3, input: texts, task: retrieval.passage, # 入库时用 passage } resp requests.post(JINA_EMBED_URL, headersheaders, jsonpayload, timeout30) resp.raise_for_status() data resp.json() return [item[embedding] for item in data[data]]中英双语对齐原理Jina v3 在训练时对中英文互译对进行了对比学习同一语义的中英文文本在向量空间中距离很近无需翻译即可跨语言检索。批量入库时分批处理避免单次请求超时def add_chunks(chunks: list[dict], batch_size: int 64) - int: added 0 for i in range(0, len(chunks), batch_size): batch chunks[i:i batch_size] texts [c[text] for c in batch] embeddings embed_texts(texts) for chunk, emb in zip(batch, embeddings): # 存入 Chroma collection.add( ids[f{chunk[doc_id]}_{chunk[chunk_id]}], embeddings[emb], metadatas[{ doc_id: chunk[doc_id], section: chunk[section], chunk_id: chunk[chunk_id], }], documents[chunk[text]], ) added 1 return added六、RAG 检索查询向量化 相似度检索def search(query: str, doc_ids: list[str] None, top_k: int 8) - list[dict]: # 查询时用 retrieval.query task区别于入库的 passage headers {Authorization: fBearer {JINA_API_KEY}} payload { model: jina-embeddings-v3, input: [query], task: retrieval.query, } resp requests.post(JINA_EMBED_URL, headersheaders, jsonpayload, timeout20) query_emb resp.json()[data][0][embedding] # Chroma 相似度检索 where {doc_id: {$in: doc_ids}} if doc_ids else None results collection.query( query_embeddings[query_emb], n_resultstop_k, wherewhere, include[documents, metadatas, distances], ) hits [] for doc, meta, dist in zip( results[documents][0], results[metadatas][0], results[distances][0], ): hits.append({ text: doc, doc_id: meta[doc_id], section: meta[section], chunk_id: meta[chunk_id], score: 1 - dist, # 余弦距离转相似度 }) return hits七、多文献场景避免单篇独占检索结果当用户同时勾选 5 篇文献提问朴素的 top_k8 会导致最相关的那篇占满 8 个名额其他文献完全没有出现在上下文里。解决思路检测到多文献对比意图时对每篇单独分配检索配额。MULTI_KEYWORDS [分析, 对比, 比较, 所有, 这些, 每篇, 综合, compare, all, each, these papers] def is_multi_intent(query: str) - bool: return any(kw in query.lower() for kw in MULTI_KEYWORDS) def search_multi_docs(query: str, doc_ids: list[str], per_doc: int 3) - list[dict]: 每篇文献单独检索 per_doc 个块 all_hits [] for doc_id in doc_ids: hits search(query, doc_ids[doc_id], top_kper_doc) # 如果没检索到摘要补充摘要块 has_abstract any(摘要 in h[section] or abstract in h[section].lower() for h in hits) if not has_abstract: abstract_hits get_section_chunks(doc_id, abstract, n1) hits abstract_hits hits[:per_doc - 1] all_hits.extend(hits) return all_hits在路由层判断if len(selected_doc_ids) 1 and is_multi_intent(question): hits search_multi_docs(question, selected_doc_ids, per_doc3) else: hits search(question, doc_idsselected_doc_ids, top_kTOP_K) hits expand_hits_to_parent(hits, windowPARENT_WINDOW)八、构建 Promptdef build_prompt(question: str, hits: list[dict], doc_names: dict) - str: context_parts [] for hit in hits: doc_name doc_names.get(hit[doc_id], hit[doc_id]) context_parts.append( f【{doc_name} · {hit[section]}】\n{hit[text]} ) context \n\n---\n\n.join(context_parts) return f你是一位严谨的学术研究助手。请根据以下文献内容回答用户问题。 回答要求 - 基于提供的文献内容不要凭空捏造 - 如引用具体数据标明来源文献和章节 - 用中文回答学术风格简洁准确 文献内容 {context} 用户问题{question}九、效果对比方案问题朴素 top_k8 全库检索多文献时单篇独占参考文献污染按章节过滤用户指定章节才有效本方案子块检索父块上下文多文献分配✅ 均衡覆盖各文献上下文完整参考文献不参与检索十、小结本篇核心要点PyMuPDF 字体分析比关键词匹配更准确地识别英文论文章节双语章节名Abstract摘要格式中英文查询都能命中参考文献过滤直接在入库阶段跳过不是检索时过滤子块父块向量检索精准传给 LLM 的上下文完整多文献均衡分配识别对比意图每篇单独检索下一篇多模型路由设计——Groq 限流了怎么办一个优雅的自动降级方案。“本文由 [六墨书场] 团队原创更多好文欢迎在微信端搜索关注。”