Refa: GraphRAG and explaining GraphRAG stalling behavior on large files (#8223)

### What problem does this PR solve? This PR investigates the cause of #7957. TL;DR: Incorrect similarity calculations lead to too many candidates. Since candidate selection involves interaction with the LLM, this causes significant delays in the program. What this PR does: 1. **Fix similarity calculation**: When processing a 64 pages government document, the corrected similarity calculation reduces the number of candidates from over 100,000 to around 16,000. With a default batch size of 100 pairs per LLM call, this fix reduces unnecessary LLM interactions from over 1,000 calls to around 160, a roughly 10x improvement. 2. **Add concurrency and timeout limits**: Up to 5 entity types are processed in "parallel", each with a 180-second timeout. These limits may be configurable in future updates. 3. **Improve logging**: The candidate resolution process now reports progress in real time. 4. **Mitigates potential concurrency risks** ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring
2026-01-04 03:25:30 +08:00 · 2025-06-12 19:09:50 +08:00
parent d36c8d18b1
commit 24ca4cc6b7
3 changed files with 90 additions and 16 deletions
--- a/graphrag/general/community_reports_extractor.py
+++ b/graphrag/general/community_reports_extractor.py
@ -89,7 +89,15 @@ class CommunityReportsExtractor(Extractor):
            text = perform_variable_replacements(self._extraction_prompt, variables=prompt_variables)
            gen_conf = {"temperature": 0.3}
            async with chat_limiter:
-                response = await trio.to_thread.run_sync(lambda: self._chat(text, [{"role": "user", "content": "Output:"}], gen_conf))
+                try:
+                    with trio.move_on_after(120) as cancel_scope:
+                        response = await trio.to_thread.run_sync( self._chat, text, [{"role": "user", "content": "Output:"}], gen_conf)
+                    if cancel_scope.cancelled_caught:
+                        logging.warning("extract_community_report._chat timeout, skipping...")
+                        return
+                except Exception as e:
+                    logging.error(f"extract_community_report._chat failed: {e}")
+                    return
            token_count += num_tokens_from_string(text + response)
            response = re.sub(r"^[^\{]*", "", response)
            response = re.sub(r"[^\}]*$", "", response)