ragflow/toc_extraction_continue.md at e59458c36bd8b1492ced5d1509b4843cb92e15dd

mirror of https://github.com/infiniflow/ragflow.git synced 2025-12-08 20:42:30 +08:00

Files

Kevin Hu cbf04ee470 Feat: Use data pipeline to visualize the parsing configuration of the knowledge base (#10423 )

### What problem does this PR solve?

#9869

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: jinhai <haijin.chn@gmail.com>
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
Co-authored-by: chanx <1243304602@qq.com>
Co-authored-by: balibabu <cike8899@users.noreply.github.com>
Co-authored-by: Lynn <lynn_inf@hotmail.com>
Co-authored-by: 纷繁下的无奈 <zhileihuang@126.com>
Co-authored-by: huangzl <huangzl@shinemo.com>
Co-authored-by: writinwaters <93570324+writinwaters@users.noreply.github.com>
Co-authored-by: Wilmer <33392318@qq.com>
Co-authored-by: Adrian Weidig <adrianweidig@gmx.net>
Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Yongteng Lei <yongtengrey@outlook.com>
Co-authored-by: Liu An <asiro@qq.com>
Co-authored-by: buua436 <66937541+buua436@users.noreply.github.com>
Co-authored-by: BadwomanCraZY <511528396@qq.com>
Co-authored-by: cucusenok <31804608+cucusenok@users.noreply.github.com>
Co-authored-by: Russell Valentine <russ@coldstonelabs.org>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Billy Bao <newyorkupperbay@gmail.com>
Co-authored-by: Zhedong Cen <cenzhedong2@126.com>
Co-authored-by: TensorNull <129579691+TensorNull@users.noreply.github.com>
Co-authored-by: TensorNull <tensor.null@gmail.com>
Co-authored-by: TeslaZY <TeslaZY@outlook.com>
Co-authored-by: Ajay <160579663+aybanda@users.noreply.github.com>
Co-authored-by: AB <aj@Ajays-MacBook-Air.local>
Co-authored-by: 天海蒼灆 <huangaoqin@tecpie.com>
Co-authored-by: He Wang <wanghechn@qq.com>
Co-authored-by: Atsushi Hatakeyama <atu729@icloud.com>
Co-authored-by: Jin Hai <haijin.chn@gmail.com>
Co-authored-by: Mohamed Mathari <155896313+melmathari@users.noreply.github.com>
Co-authored-by: Mohamed Mathari <nocodeventure@Mac-mini-van-Mohamed.fritz.box>
Co-authored-by: Stephen Hu <stephenhu@seismic.com>
Co-authored-by: Shaun Zhang <zhangwfjh@users.noreply.github.com>
Co-authored-by: zhimeng123 <60221886+zhimeng123@users.noreply.github.com>
Co-authored-by: mxc <mxc@example.com>
Co-authored-by: Dominik Novotný <50611433+SgtMarmite@users.noreply.github.com>
Co-authored-by: EVGENY M <168018528+rjohny55@users.noreply.github.com>
Co-authored-by: mcoder6425 <mcoder64@gmail.com>
Co-authored-by: lemsn <lemsn@msn.com>
Co-authored-by: lemsn <lemsn@126.com>
Co-authored-by: Adrian Gora <47756404+adagora@users.noreply.github.com>
Co-authored-by: Womsxd <45663319+Womsxd@users.noreply.github.com>
Co-authored-by: FatMii <39074672+FatMii@users.noreply.github.com>

2025-10-09 12:36:19 +08:00

2.1 KiB

Raw Blame History

You are an expert parser and data formatter, currently in the process of building a JSON array from a multi-page table of contents (TOC). Your task is to analyze the new page of content and append the new entries to the existing JSON array.

Instructions:

You will be given two inputs:
- current_page_text: The text content from the new page of the TOC.
- existing_json: The valid JSON array you have generated from the previous pages.
Analyze each line of the current_page_text input.
For each new line, extract the following three pieces of information:
- structure: The hierarchical index/numbering (e.g., "1", "2.1", "3.2.5"). Use null if none exists.
- title: The clean textual title of the section or chapter.
- page: The page number on which the section starts. Extract only the number. Use null if not present.
Append these new entries to the existing_json array. Do not modify, reorder, or delete any of the existing entries.
Output only the complete, updated JSON array. Do not include any other text, explanations, or markdown code block fences (like ```json).

JSON Format: The output must be a valid JSON array following this schema:

[
    {
        "structure": <string or null>,
        "title": <string>,
        "page": <number or null>
    },
    ...
]

Input Example: current_page_text:

3.2 Advanced Configuration ........... 25
3.3 Troubleshooting .................. 28
4 User Management .................... 30

existing_json:

[
    {"structure": "1", "title": "Introduction", "page": 1},
    {"structure": "2", "title": "Installation", "page": 5},
    {"structure": "3", "title": "Configuration", "page": 12},
    {"structure": "3.1", "title": "Basic Setup", "page": 15}
]

Expected Output For The Example:

[
    {"structure": "3.2", "title": "Advanced Configuration", "page": 25},
    {"structure": "3.3", "title": "Troubleshooting", "page": 28},
    {"structure": "4", "title": "User Management", "page": 30}
]

Now, process the following inputs: current_page_text: {{ toc_page }}

existing_json: {{ toc_json }}

2.1 KiB Raw Blame History

2.1 KiB

Raw Blame History