Building RAG with Chinese LLMs: From Document Processing to Vector Search
Build a complete RAG (Retrieval-Augmented Generation) system using ChinaWHAPI model combinations, covering document chunking, embedding, vector search, and generation.
RAG Workflow
RAG = Retrieval + Augmented + Generation. The core idea is to first retrieve relevant content from a knowledge base before generating an answer, then combine the retrieved results with the original question for the LLM to produce an answer.
Document Processing
Split long documents into appropriately sized chunks (typically 500-1000 characters), each chunk independently embedded. This improves retrieval accuracy and helps control costs.
Vector Search
Use an embedding model to convert each chunk into a vector, stored in a vector database (e.g., Milvus, Pinecone, Qdrant). At retrieval time, convert the user's question into a vector and find the most similar chunks using cosine similarity.
Generation Phase
{"model":"qwen3.6-plus","messages":[{"role":"system","content":"You are a technical support assistant. Answer user questions based on the following reference materials. If the materials don't contain relevant information, say so honestly, do not fabricate."},{"role":"user","content":"Reference materials: {retrieved_chunks}
Question: {question}"}]}ChinaWHAPI's Role in RAG
ChinaWHAPI provides all required models: embedding models for vectorization, DeepSeek V4 Flash for initial recall filtering, Qwen3.6 Plus for final generation, and Kimi K2.6 for ultra-long documents.