====== llm_index: RAG index builder ====== ''llm_index'' builds the search index that grounds the [[module:chat_llm|LLM Guru]] in your BBS's own content. It reads configured content sources, builds a BM25 keyword index, and writes it to disk; the chat engine then retrieves the most relevant chunks each turn and injects them into the model's prompt (the ''@retrieved_context@'' macro). This is what lets the Guru answer "what's been posted about X on this board" instead of inventing an answer. Indexing is **entirely local** — every source is a local file/database read. No network calls, no API tokens, nothing leaves the host at index time. The builder lives in ''exec/llm_index.js''; each source is a crawler file under ''exec/llm_index/''. ===== Building an index ===== Run it under [[util:jsexec]], passing the persona whose [[config:chat_llm.ini]] section to read: jsexec llm_index.js guru With no argument it uses the ''default'' section. It reads ''index_sources'' from that section, runs each crawler, builds the BM25 index, and writes it to ''index_output'' (default ''data/chat/.idx''). The index is a static file — rebuild it to pick up new content. A common setup is a nightly timed event that re-runs the builder so the Guru's knowledge stays current. Retrieval tuning (how many chunks, the relevance gate, source weights) lives in [[config:chat_llm.ini#retrieval_rag|chat_llm.ini]]. ===== Bundled crawlers ===== A source in ''index_sources'' names a crawler file under ''exec/llm_index/''. An optional ''%%:argument%%'' suffix is source-specific. ^ Source ^ Argument ^ Indexes ^ | ''msgbase'' | Comma-separated **group names** to include (omit for all groups). | One chunk per non-deleted, non-private message (subject + body), tagged with sub, author, and date for citation. | | ''filebase'' | Comma-separated **library names** to include (omit for all). | One chunk per file with its description; skips non-public directories. | | ''dokuwiki'' | **Path** to the DokuWiki ''data/pages'' directory. | One chunk per wiki page, tagged with its namespace path. | Both ''msgbase'' and ''filebase'' accept **per-container exclusions** with a ''%%/-token%%'' suffix, matched case-insensitively against a sub/dir code or name. For example, to index the ''Main'' group but skip its bot-mirror subs: index_sources = msgbase:Main/-gitlog/-commits,DOVE-Net And a documentation-grounded persona combining local posts, files, and a local wiki tree (sources are **semicolon**-separated): index_sources = msgbase:Local,DOVE-Net; filebase; dokuwiki:/var/www/html/wiki/data/pages ===== Writing a crawler ===== Create ''exec/llm_index/.js'' that defines a ''crawl(opts)'' function returning an array of chunk objects: function crawl(opts) { // opts.arg -- the text after the ':' in the source spec (or null) // opts.max_chunks -- soft cap the builder may pass var chunks = []; chunks.push({ id: "unique-id", // stable id for this chunk text: "the body text to index and retrieve", provenance: "From ", // citation string shown to the model title: "short title", // optional; boosts title-term matches ts: when_epoch_seconds // optional; enables recency weighting }); return chunks; } Name the source in ''index_sources'' and it's picked up automatically — the chat engine needs no changes. Keep crawlers read-only and local. ===== See Also ===== * [[config:chat_llm.ini]] — ''index_*'' settings and source syntax * [[module:chat_llm]] — how retrieved chunks are used * [[module:llm_tools]] — live lookups (complementary to RAG) {{tag>chat guru llm chat_llm rag bm25 index ai}}