Evaluating Embedding Models and Pipeline Optimization for AI Search Quality
Zhong, Philip, Chen, Kent, Wang, Don
–arXiv.org Artificial Intelligence
We evaluate the performance of various text embedding models and pipeline configurations for AI-driven search systems. We compare sentence-transformer and generative embedding models (e.g., All-MPNet, BGE, GTE, and Qwen) at different dimensions, indexing methods (Milvus HNSW/IVF), and chunking strategies. A custom evaluation dataset of 11,975 query-chunk pairs was synthesized from US City Council meeting transcripts using a local large language model (LLM). The data pipeline includes preprocessing, automated question generation per chunk, manual validation, and continuous integration/continuous deployment (CI/CD) integration. We measure retrieval accuracy using reference-based metrics: Top-K Accuracy and Normalized Discounted Cumulative Gain (NDCG). Our results demonstrate that higher-dimensional embeddings significantly boost search quality (e.g., Qwen3-Embedding-8B/4096 achieves Top-3 accuracy about 0.571 versus 0.412 for GTE-large/1024), and that neural re-rankers (e.g., a BGE cross-encoder) further improve ranking accuracy (Top-3 up to 0.527). Finer-grained chunking (512 characters versus 2000 characters) also improves accuracy. We discuss the impact of these factors and outline future directions for pipeline automation and evaluation.
arXiv.org Artificial Intelligence
Dec-1-2025
- Country:
- North America > United States > California > Santa Clara County > San Jose (0.04)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Government (0.34)
- Technology: