====== llm_index: RAG index builder ======

''llm_index'' builds the search index that grounds the [[module:chat_llm|LLM
Guru]] in your BBS's own content.  It reads configured content sources, builds
a BM25 keyword index, and writes it to disk; the chat engine then retrieves
the most relevant chunks each turn and injects them into the model's prompt
(the ''@retrieved_context@'' macro).  This is what lets the Guru answer
"what's been posted about X on this board" instead of inventing an answer.

Indexing is **entirely local** — every source is a local file/database read.
No network calls, no API tokens, nothing leaves the host at index time.

The builder lives in ''exec/llm_index.js''; each source is a crawler file
under ''exec/llm_index/''.

===== Building an index =====

Run it under [[util:jsexec]], passing the persona whose
[[config:chat_llm.ini]] section to read:

<code>
jsexec llm_index.js guru
</code>

With no argument it uses the ''default'' section.  It reads ''index_sources''
from that section, runs each crawler, builds the BM25 index, and writes it to
''index_output'' (default ''data/chat/<persona>.idx'').

The index is a static file — rebuild it to pick up new content.  A common
setup is a nightly timed event that re-runs the builder so the Guru's
knowledge stays current.  Retrieval tuning (how many chunks, the relevance
gate, source weights) lives in [[config:chat_llm.ini#retrieval_rag|chat_llm.ini]].

===== Bundled crawlers =====

A source in ''index_sources'' names a crawler file under ''exec/llm_index/''.
An optional ''%%:argument%%'' suffix is source-specific.

^ Source ^ Argument ^ Indexes ^
| ''msgbase'' | Comma-separated **group names** to include (omit for all groups). | One chunk per non-deleted, non-private message (subject + body), tagged with sub, author, and date for citation. |
| ''filebase'' | Comma-separated **library names** to include (omit for all). | One chunk per file with its description; skips non-public directories. |
| ''dokuwiki'' | **Path** to the DokuWiki ''data/pages'' directory. | One chunk per wiki page, tagged with its namespace path. |

Both ''msgbase'' and ''filebase'' accept **per-container exclusions** with a
''%%/-token%%'' suffix, matched case-insensitively against a sub/dir code or
name.  For example, to index the ''Main'' group but skip its bot-mirror subs:

<code ini>
index_sources = msgbase:Main/-gitlog/-commits,DOVE-Net
</code>

And a documentation-grounded persona combining local posts, files, and a local
wiki tree (sources are **semicolon**-separated):

<code ini>
index_sources = msgbase:Local,DOVE-Net; filebase; dokuwiki:/var/www/html/wiki/data/pages
</code>

===== Writing a crawler =====

Create ''exec/llm_index/<name>.js'' that defines a ''crawl(opts)'' function
returning an array of chunk objects:

<code javascript>
function crawl(opts) {
    // opts.arg        -- the text after the ':' in the source spec (or null)
    // opts.max_chunks -- soft cap the builder may pass
    var chunks = [];
    chunks.push({
        id:         "unique-id",        // stable id for this chunk
        text:       "the body text to index and retrieve",
        provenance: "From <where>",     // citation string shown to the model
        title:      "short title",      // optional; boosts title-term matches
        ts:         when_epoch_seconds  // optional; enables recency weighting
    });
    return chunks;
}
</code>

Name the source in ''index_sources'' and it's picked up automatically — the
chat engine needs no changes.  Keep crawlers read-only and local.

===== See Also =====

  * [[config:chat_llm.ini]] — ''index_*'' settings and source syntax
  * [[module:chat_llm]] — how retrieved chunks are used
  * [[module:llm_tools]] — live lookups (complementary to RAG)

{{tag>chat guru llm chat_llm rag bm25 index ai}}