After scaling dataset embedding drift my RAG pipeline finds unnecessary chunks

0 votes
May 21 in Generative AI by gaurav
• 24,860 points
36 views

1 answer to this question.

0 votes

Indeed, embedding drift may be a contributing factor, but in practical RAG systems, retrieval-quality deterioration problems typically lead to irrelevant retrieval upon scaling.

Retrieval issues that were undetectable at 1k chunks become apparent at 1M chunks as datasets get larger.

The Most Frequent Reasons for Scale RAG Retrieval Degradation

Chunking Strategy (Most Common) Breaks at Scale

Chunking techniques that are effective for small datasets frequently don't work for huge corporations.

For instance:

fixed 1,000-token blocks with no semantic borders, too little overlap, and improperly divided tables and code

Outcome:

Embedding vectors lose their discriminative power and become noisy.

In vector space, two unconnected pieces could get "close."

  • Semantically broad chunks predominate in symptoms

  • generic chunks are frequently retrieved

  •  headers are retrieved rather than responses.

Fix

  • Make use of semantic chunking

  • Aware of paragraphs, sections, markdown, codes, and tables

  • Improved chunk sizes

  • Adaptive chunking > fixed chunking with 200–500 tokens for dense retrieval  

answered 6 days ago by anonymous
• 1,420 points

Related Questions In Generative AI

0 votes
0 answers
0 votes
0 answers

Why does my GAN model fail to converge after 100 epochs?

With the help of proper code explanation ...READ MORE

Jan 22, 2025 in Generative AI by Ashutosh
• 33,370 points
726 views
0 votes
1 answer
0 votes
1 answer

My dataset download script stopped working after a login redirect change. How do I fix session cookies?

A login redirect change usually breaks scripts ...READ MORE

answered May 20 in Generative AI by subhashini
• 1,420 points
75 views
0 votes
1 answer

My structured output parsing fails after model upgrade . schema drift?

Yes. schema drift is one of the ...READ MORE

answered May 20 in Generative AI by subhashini
• 1,420 points
132 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP