Supported file formats are DOCX, PDF, TXT.
Since a book is long and not all the parts are useful, if it's a PDF,
- please setup the page ranges for every book in order eliminate negative effects and save computing time for analyzing.`,
+ please setup the page ranges for every book in order eliminate negative effects and save computing time for analyzing.
`,
},
laws: {
title: '',
- description: `Supported file formats are docx, pdf, txt.`,
+ description: `Supported file formats are DOCX, PDF, TXT.
+ Legal documents have a very rigorous writing format. We use text feature to detect split point.
+
+ The chunk granularity is consistent with 'ARTICLE', and all the upper level text will be included in the chunk.
+
`,
},
- manual: { title: '', description: `Only pdf is supported.` },
- media: { title: '', description: '' },
+ manual: { title: '', description: `Only PDF is supported.
+ We assume manual has hierarchical section structure. We use the lowest section titles as pivots to slice documents.
+ So, the figures and tables in the same section will not be sliced apart, and chunk size might be large.
+
` },
naive: {
title: '',
- description: `Supported file formats are docx, pdf, txt.
- This method apply the naive ways to chunk files.
- Successive text will be sliced into pieces using 'delimiter'.
- Next, these successive pieces are merge into chunks whose token number is no more than 'Max token number'.`,
+ description: `Supported file formats are DOCX, EXCEL, PPT, IMAGE, PDF, TXT.
+ This method apply the naive ways to chunk files:
+
+
Successive text will be sliced into pieces using vision detection model.
+ Next, these successive pieces are merge into chunks whose token number is no more than 'Token number'.`,
},
paper: {
title: '',
- description: `Only pdf is supported.
- The special part is that, the abstract of the paper will be sliced as an entire chunk, and will not be sliced partly.`,
+ description: `Only PDF file is supported.
+ If our model works well, the paper will be sliced by it's sections, like abstract, 1.1, 1.2, etc.
+ The benefit of doing this is that LLM can better summarize the content of relevant sections in the paper,
+ resulting in more comprehensive answers that help readers better understand the paper.
+ The downside is that it increases the context of the LLM conversation and adds computational cost,
+ so during the conversation, you can consider reducing the ‘topN’ setting.
`,
},
presentation: {
title: '',
- description: `The supported file formats are pdf, pptx.
- Every page will be treated as a chunk. And the thumbnail of every page will be stored.
- PPT file will be parsed by using this method automatically, setting-up for every PPT file is not necessary.`,
+ description: `The supported file formats are PDF, PPTX.
+ Every page will be treated as a chunk. And the thumbnail of every page will be stored.
+ All the PPT files you uploaded will be chunked by using this method automatically, setting-up for every PPT file is not necessary.
`,
},
qa: {
title: '',
- description: `Excel and csv(txt) format files are supported.
- If the file is in excel format, there should be 2 column question and answer without header.
+ description: `EXCEL and CSV/TXT files are supported.
+ If the file is in excel format, there should be 2 columns question and answer without header.
And question column is ahead of answer column.
- And it's O.K if it has multiple sheets as long as the columns are rightly composed.
+ And it's O.K if it has multiple sheets as long as the columns are rightly composed.
- If it's in csv format, it should be UTF-8 encoded. Use TAB as delimiter to separate question and answer.
+ If it's in csv format, it should be UTF-8 encoded. Use TAB as delimiter to separate question and answer.
- All the deformed lines will be ignored.
- Every pair of Q&A will be treated as a chunk.`,
+ All the deformed lines will be ignored.
+ Every pair of Q&A will be treated as a chunk.
`,
},
resume: {
title: '',
- description: `The supported file formats are pdf, docx and txt.`,
+ description: `The supported file formats are DOCX, PDF, TXT.
+
+ The résumé comes in a variety of formats, just like a person’s personality, but we often have to organize them into structured data that makes it easy to search.
+
+ Instead of chunking the résumé, we parse the résumé into structured data. As a HR, you can dump all the résumé you have,
+ the you can list all the candidates that match the qualifications just by talk with 'RagFlow'.
+
+ `,
},
table: {
title: '',
- description: `Excel and csv(txt) format files are supported.
- For csv or txt file, the delimiter between columns is TAB.
- The first line must be column headers.
- Column headers must be meaningful terms inorder to make our NLP model understanding.
- It's good to enumerate some synonyms using slash '/' to separate, and even better to
- enumerate values using brackets like 'gender/sex(male, female)'.
- Here are some examples for headers:
- 1. supplier/vendor\tcolor(yellow, red, brown)\tgender/sex(male, female)\tsize(M,L,XL,XXL)
- 2. 姓名/名字\t电话/手机/微信\t最高学历(高中,职高,硕士,本科,博士,初中,中技,中专,专科,专升本,MPA,MBA,EMBA)
- Every row in table will be treated as a chunk.
-
-visual:
- Image files are supported. Video is comming soon.
- If the picture has text in it, OCR is applied to extract the text as a description of it.
- If the text extracted by OCR is not enough, visual LLM is used to get the descriptions.`,
+ description: `EXCEL and CSV/TXT format files are supported.
+ Here're some tips:
+
`,
+},
+picture: {
+ title: '',
+ description: `
+ Image files are supported. Video is coming soon.
+ If the picture has text in it, OCR is applied to extract the text as its text description.
+
+ If the text extracted by OCR is not enough, visual LLM is used to get the descriptions.
+
`,
+ },
+one: {
+ title: '',
+ description: `
+ Supported file formats are DOCX, EXCEL, PDF, TXT.
+
+ For a document, it will be treated as an entire chunk, no split at all.
+
+ If you don't trust any chunk method and the selected LLM's context length covers the document length, you can try this method.
+
`,
},
};
diff --git a/web/src/pages/add-knowledge/components/knowledge-testing/testing-control/index.tsx b/web/src/pages/add-knowledge/components/knowledge-testing/testing-control/index.tsx
index 81d9a9562..138614a96 100644
--- a/web/src/pages/add-knowledge/components/knowledge-testing/testing-control/index.tsx
+++ b/web/src/pages/add-knowledge/components/knowledge-testing/testing-control/index.tsx
@@ -53,9 +53,10 @@ const TestingControl = ({ form, handleTesting }: IProps) => {
>
- label="Top k"
+ label="Top K"
name={'top_k'}
- tooltip="coming soon"
+ tooltip="For the computaion cost, not all the retrieved chunk will be computed vector cosine similarity with query.
+ The bigger the 'Top K' is, the higher the recall rate is, the slower the retrieval speed is."
>
diff --git a/web/src/pages/chat/chat-configuration-modal/assistant-setting.tsx b/web/src/pages/chat/chat-configuration-modal/assistant-setting.tsx
index 15b670e80..bca69b661 100644
--- a/web/src/pages/chat/chat-configuration-modal/assistant-setting.tsx
+++ b/web/src/pages/chat/chat-configuration-modal/assistant-setting.tsx
@@ -55,6 +55,7 @@ const AssistantSetting = ({ show }: ISegmentedContentProps) => {
label="Language"
initialValue={'Chinese'}
tooltip="coming soon"
+ style={{display:'none'}}
>