Table of Contents
llm_index: RAG index builder
llm_index builds the search index that grounds the LLM
Guru in your BBS's own content. It reads configured content sources, builds
a BM25 keyword index, and writes it to disk; the chat engine then retrieves
the most relevant chunks each turn and injects them into the model's prompt
(the @retrieved_context@ macro). This is what lets the Guru answer
“what's been posted about X on this board” instead of inventing an answer.
Indexing is entirely local — every source is a local file/database read. No network calls, no API tokens, nothing leaves the host at index time.
The builder lives in exec/llm_index.js; each source is a crawler file
under exec/llm_index/.
Building an index
Run it under jsexec, passing the persona whose chat_llm.ini section to read:
jsexec llm_index.js guru
With no argument it uses the default section. It reads index_sources
from that section, runs each crawler, builds the BM25 index, and writes it to
index_output (default data/chat/<persona>.idx).
The index is a static file — rebuild it to pick up new content. A common setup is a nightly timed event that re-runs the builder so the Guru's knowledge stays current. Retrieval tuning (how many chunks, the relevance gate, source weights) lives in chat_llm.ini.
Bundled crawlers
A source in index_sources names a crawler file under exec/llm_index/.
An optional :argument suffix is source-specific.
| Source | Argument | Indexes |
|---|---|---|
msgbase | Comma-separated group names to include (omit for all groups). | One chunk per non-deleted, non-private message (subject + body), tagged with sub, author, and date for citation. |
filebase | Comma-separated library names to include (omit for all). | One chunk per file with its description; skips non-public directories. |
dokuwiki | Path to the DokuWiki data/pages directory. | One chunk per wiki page, tagged with its namespace path. |
Both msgbase and filebase accept per-container exclusions with a
/-token suffix, matched case-insensitively against a sub/dir code or
name. For example, to index the Main group but skip its bot-mirror subs:
index_sources = msgbase:Main/-gitlog/-commits,DOVE-Net
And a documentation-grounded persona combining local posts, files, and a local wiki tree (sources are semicolon-separated):
index_sources = msgbase:Local,DOVE-Net; filebase; dokuwiki:/var/www/html/wiki/data/pages
Writing a crawler
Create exec/llm_index/<name>.js that defines a crawl(opts) function
returning an array of chunk objects:
function crawl(opts) { // opts.arg -- the text after the ':' in the source spec (or null) // opts.max_chunks -- soft cap the builder may pass var chunks = []; chunks.push({ id: "unique-id", // stable id for this chunk text: "the body text to index and retrieve", provenance: "From <where>", // citation string shown to the model title: "short title", // optional; boosts title-term matches ts: when_epoch_seconds // optional; enables recency weighting }); return chunks; }
Name the source in index_sources and it's picked up automatically — the
chat engine needs no changes. Keep crawlers read-only and local.
See Also
- chat_llm.ini —
index_*settings and source syntax - chat_llm — how retrieved chunks are used
- llm_tools — live lookups (complementary to RAG)