llm_index: RAG index builder

llm_index builds the search index that grounds the LLM Guru in your BBS's own content. It reads configured content sources, builds a BM25 keyword index, and writes it to disk; the chat engine then retrieves the most relevant chunks each turn and injects them into the model's prompt (the @retrieved_context@ macro). This is what lets the Guru answer “what's been posted about X on this board” instead of inventing an answer.

Indexing is entirely local — every source is a local file/database read. No network calls, no API tokens, nothing leaves the host at index time.

The builder lives in exec/llm_index.js; each source is a crawler file under exec/llm_index/.

Building an index

Run it under jsexec, passing the persona whose chat_llm.ini section to read:

jsexec llm_index.js guru

With no argument it uses the default section. It reads index_sources from that section, runs each crawler, builds the BM25 index, and writes it to index_output (default data/chat/<persona>.idx).

The index is a static file — rebuild it to pick up new content. A common setup is a nightly timed event that re-runs the builder so the Guru's knowledge stays current. Retrieval tuning (how many chunks, the relevance gate, source weights) lives in chat_llm.ini.

Bundled crawlers

A source in index_sources names a crawler file under exec/llm_index/. An optional :argument suffix is source-specific.

Source	Argument	Indexes
`msgbase`	Comma-separated group names to include (omit for all groups).	One chunk per non-deleted, non-private message (subject + body), tagged with sub, author, and date for citation.
`filebase`	Comma-separated library names to include (omit for all).	One chunk per file with its description; skips non-public directories.
`dokuwiki`	Path to the DokuWiki `data/pages` directory.	One chunk per wiki page, tagged with its namespace path.

Both msgbase and filebase accept per-container exclusions with a /-token suffix, matched case-insensitively against a sub/dir code or name. For example, to index the Main group but skip its bot-mirror subs:

index_sources = msgbase:Main/-gitlog/-commits,DOVE-Net

And a documentation-grounded persona combining local posts, files, and a local wiki tree (sources are semicolon-separated):

index_sources = msgbase:Local,DOVE-Net; filebase; dokuwiki:/var/www/html/wiki/data/pages

Writing a crawler

Create exec/llm_index/<name>.js that defines a crawl(opts) function returning an array of chunk objects:

function crawl(opts) {
    // opts.arg        -- the text after the ':' in the source spec (or null)
    // opts.max_chunks -- soft cap the builder may pass
    var chunks = [];
    chunks.push({
        id:         "unique-id",        // stable id for this chunk
        text:       "the body text to index and retrieve",
        provenance: "From <where>",     // citation string shown to the model
        title:      "short title",      // optional; boosts title-term matches
        ts:         when_epoch_seconds  // optional; enables recency weighting
    });
    return chunks;
}

Name the source in index_sources and it's picked up automatically — the chat engine needs no changes. Keep crawlers read-only and local.

Table of Contents

llm_index: RAG index builder

Building an index

Bundled crawlers

Writing a crawler

See Also