====== llm_index: RAG index builder ======
''llm_index'' builds the search index that grounds the [[module:chat_llm|LLM
Guru]] in your BBS's own content. It reads configured content sources, builds
a BM25 keyword index, and writes it to disk; the chat engine then retrieves
the most relevant chunks each turn and injects them into the model's prompt
(the ''@retrieved_context@'' macro). This is what lets the Guru answer
"what's been posted about X on this board" instead of inventing an answer.
Indexing is **entirely local** — every source is a local file/database read.
No network calls, no API tokens, nothing leaves the host at index time.
The builder lives in ''exec/llm_index.js''; each source is a crawler file
under ''exec/llm_index/''.
===== Building an index =====
Run it under [[util:jsexec]], passing the persona whose
[[config:chat_llm.ini]] section to read:
jsexec llm_index.js guru
With no argument it uses the ''default'' section. It reads ''index_sources''
from that section, runs each crawler, builds the BM25 index, and writes it to
''index_output'' (default ''data/chat/.idx'').
The index is a static file — rebuild it to pick up new content. A common
setup is a nightly timed event that re-runs the builder so the Guru's
knowledge stays current. Retrieval tuning (how many chunks, the relevance
gate, source weights) lives in [[config:chat_llm.ini#retrieval_rag|chat_llm.ini]].
===== Bundled crawlers =====
A source in ''index_sources'' names a crawler file under ''exec/llm_index/''.
An optional ''%%:argument%%'' suffix is source-specific.
^ Source ^ Argument ^ Indexes ^
| ''msgbase'' | Comma-separated **group names** to include (omit for all groups). | One chunk per non-deleted, non-private message (subject + body), tagged with sub, author, and date for citation. |
| ''filebase'' | Comma-separated **library names** to include (omit for all). | One chunk per file with its description; skips non-public directories. |
| ''dokuwiki'' | **Path** to the DokuWiki ''data/pages'' directory. | One chunk per wiki page, tagged with its namespace path. |
Both ''msgbase'' and ''filebase'' accept **per-container exclusions** with a
''%%/-token%%'' suffix, matched case-insensitively against a sub/dir code or
name. For example, to index the ''Main'' group but skip its bot-mirror subs:
index_sources = msgbase:Main/-gitlog/-commits,DOVE-Net
And a documentation-grounded persona combining local posts, files, and a local
wiki tree (sources are **semicolon**-separated):
index_sources = msgbase:Local,DOVE-Net; filebase; dokuwiki:/var/www/html/wiki/data/pages
===== Writing a crawler =====
Create ''exec/llm_index/.js'' that defines a ''crawl(opts)'' function
returning an array of chunk objects:
function crawl(opts) {
// opts.arg -- the text after the ':' in the source spec (or null)
// opts.max_chunks -- soft cap the builder may pass
var chunks = [];
chunks.push({
id: "unique-id", // stable id for this chunk
text: "the body text to index and retrieve",
provenance: "From ", // citation string shown to the model
title: "short title", // optional; boosts title-term matches
ts: when_epoch_seconds // optional; enables recency weighting
});
return chunks;
}
Name the source in ''index_sources'' and it's picked up automatically — the
chat engine needs no changes. Keep crawlers read-only and local.
===== See Also =====
* [[config:chat_llm.ini]] — ''index_*'' settings and source syntax
* [[module:chat_llm]] — how retrieved chunks are used
* [[module:llm_tools]] — live lookups (complementary to RAG)
{{tag>chat guru llm chat_llm rag bm25 index ai}}