| Title: | Open Knowledge Format (OKF) Ingestion |
|---|---|
| Description: | Read, validate, and load Open Knowledge Format (OKF) bundles (a directory of markdown files with YAML frontmatter) into a portable DuckDB catalog, build the concept graph, render to HTML, and optionally embed concept bodies for semantic search. Deterministic and agent-free: the same bundle always yields the same catalog, graph, and render, with no LLM calls in the core. Conformant and permissive per the OKF v0.1 specification. |
| Authors: | Travis Jakel [aut, cre] |
| Maintainer: | Travis Jakel <[email protected]> |
| License: | Apache License (>= 2) |
| Version: | 0.5.2 |
| Built: | 2026-06-30 13:25:44 UTC |
| Source: | https://github.com/cran/okf |
Concepts that link to a given concept ("linked from" / backlinks).
okf_backlinks(con, path)okf_backlinks(con, path)
con |
An open DuckDB connection to an okf catalog. |
path |
Bundle-relative concept path. |
Character vector of source concept paths (resolved inbound links).
Split a concept body into chunks on paragraph boundaries.
okf_chunk_body(body, target_chars = 600L)okf_chunk_body(body, target_chars = 600L)
body |
Concept body text. |
target_chars |
Approximate maximum chunk size in characters. |
Character vector of chunks.
Operates on the undirected resolved-link graph. Each node starts in its own community; nodes iteratively adopt the most common label among neighbours, ties broken by the lexicographically smallest label (so the result is fully reproducible – no randomness). Isolated nodes keep their own label.
okf_clusters(con, max_iter = 50L, include_reserved = FALSE)okf_clusters(con, max_iter = 50L, include_reserved = FALSE)
con |
An open DuckDB connection to an okf catalog. |
max_iter |
Maximum propagation sweeps. |
include_reserved |
Include reserved concepts ('index.md'/'log.md') as nodes – useful for graph visualization, where 'index.md' is the hub. |
A data.frame with 'path' and integer 'cluster' (1-based, stable order).
This is the OKF / "LLM wiki" consume primitive (Karpathy): hand the agent 'index.md' plus the relevant concept(s) and their link-neighborhood to read directly. It uses the concept graph – **no embeddings, no vector search**. With 'start', it walks the (undirected) link graph from that concept to 'depth'; without 'start', it packs all concepts. Output is capped to roughly 'max_tokens' (estimated at ~4 chars/token).
okf_context( con, start = NULL, depth = 1L, max_tokens = 8000L, include_index = TRUE )okf_context( con, start = NULL, depth = 1L, max_tokens = 8000L, include_index = TRUE )
con |
An open DuckDB connection to an okf catalog. |
start |
Optional concept path to center the neighborhood on. |
depth |
Link-graph radius around 'start' (ignored when 'start' is NULL). |
max_tokens |
Approximate output budget. |
include_index |
Prepend 'index.md' (the map) when present. |
A list with 'text' (the markdown blob), 'included'/'omitted' concept paths, and 'est_tokens'.
Combines the validation findings already stored in the catalog (missing type, broken links, orphans, non-ISO timestamps, ...) with maintenance checks (duplicate titles; and, when 'now' is supplied, future/stale timestamps), and computes a health 'score' = the percentage of non-reserved concepts with zero findings. Fully deterministic.
okf_doctor(con, now = NULL, stale_days = NULL)okf_doctor(con, now = NULL, stale_days = NULL)
con |
An open DuckDB connection to an okf catalog. |
now |
Optional ISO-8601 "current time" enabling stale/future-timestamp checks (kept explicit so the function stays deterministic; the CLI passes the wall clock). |
stale_days |
Optional integer; with 'now', flag timestamps older than this many days. |
A list with 'score', 'n_concepts', 'n_healthy', 'n_error', 'n_warn', 'by_rule' (named counts), and 'issues' (a data.frame of path/severity/rule/ message).
Two mechanical, deterministic repairs (never invents content):
**timestamps** – a parseable non-ISO 'timestamp:' is rewritten to ISO-8601.
**moved links** – a broken link whose basename matches *exactly one* concept is re-pointed to that concept (relative to the linking file).
Edits files in place. Anything ambiguous is left for [okf_doctor()] to report.
okf_doctor_fix(root)okf_doctor_fix(root)
root |
A bundle directory path. |
A data.frame of changes ('path', 'kind', 'before', 'after'); zero rows if nothing was safely fixable.
Populates 'okf_chunk' with one row per chunk plus its embedding vector and the concept's 'content_hash'. By default replaces all chunks. With 'incremental = TRUE', only concepts whose 'content_hash' differs from what was last embedded are re-embedded (and removed concepts' chunks are dropped) – the expensive embedder calls are skipped for unchanged concepts.
okf_embed(con, embedder = NULL, target_chars = 600L, incremental = FALSE)okf_embed(con, embedder = NULL, target_chars = 600L, incremental = FALSE)
con |
An open DuckDB connection to an okf catalog. |
embedder |
An embedder function; defaults to [okf_ollama_embedder()]. |
target_chars |
Approximate chunk size in characters. |
incremental |
Re-embed only concepts whose content changed. |
The number of chunks (re)written this call (invisibly usable as an integer).
Extract markdown link targets from a concept body (OKF cross-links, sec. 4).
okf_extract_links(body)okf_extract_links(body)
body |
Concept body text. |
Character vector of raw link targets (as written).
Local directories are used in place. Git URLs (github/gitlab/bitbucket, '.git', or 'git@') are shallow-cloned. Tar/zip archives (local path or 'http(s)' URL) are downloaded if remote and extracted. The caller MUST invoke the returned 'cleanup()' when done to remove any temporary files.
okf_fetch(source, subdir = NULL, branch = NULL)okf_fetch(source, subdir = NULL, branch = NULL)
source |
A directory path, git URL, or tar/zip path/URL. |
subdir |
Optional bundle path within the cloned/extracted tree. |
branch |
Optional git branch or tag (git sources only). |
A list with 'dir' (the resolved bundle directory), 'source_kind' ('"dir"'/'"git"'/'"tar"'/'"zip"'), and 'cleanup' (a function).
A force-directed graph drawn on a '<canvas>' with hand-rolled vanilla JS (no CDN, no framework) – pan, zoom, drag, type-to-search, nodes coloured by community ([okf_clusters()]). Clicking a node navigates to its rendered '.html' (relative), so dropping 'graph.html' into an [okf_html()] site root makes the graph a live map of the site. Fully offline; embeds the node/edge model as JSON.
okf_graph_html(con, out, site_title = NULL)okf_graph_html(con, out, site_title = NULL)
con |
An open DuckDB connection to an okf catalog. |
out |
Output '.html' file path. |
site_title |
Optional page title; defaults to the bundle directory name. |
The output path (invisibly).
Returns a JSON object with 'nodes' and 'edges'. Nodes carry 'id' (path), 'type', 'title', 'tags', 'cluster' (from [okf_clusters()]), and 'href' (the rendered '.html' path). Edges are resolved links with 'source' and 'target' fields. Feeds any external graph visualizer – the same "core is a contract" idea as the DuckDB catalog.
okf_graph_json(con, pretty = TRUE)okf_graph_json(con, pretty = TRUE)
con |
An open DuckDB connection to an okf catalog. |
pretty |
Pretty-print the JSON. |
A JSON string (invisibly also suitable for writing to a file).
A text diagram for embedding directly in markdown (READMEs, docs, GitHub renders it natively) – the lightweight complement to the interactive [okf_graph_html()]. Node ids are sanitized; labels are the concept titles.
okf_graph_mermaid(con)okf_graph_mermaid(con)
con |
An open DuckDB connection to an okf catalog. |
A Mermaid diagram as a single string (a ““ “'mermaid ““ block).
Two modes. As a navigable **site** ('single = FALSE', the default), writes one self-contained ‘.html' per concept under 'out/' (mirroring the bundle’s directory tree) plus an 'index.html' landing page; internal '.md' links are rewritten to '.html'. As a **single file** ('single = TRUE'), writes one self-contained '.html' at path 'out', with each concept an anchored '<section>' and intra-bundle links rewritten to in-page anchors. No JavaScript; CSS is inlined so output is portable. Reserved concepts ('index.md', 'log.md') are rendered too. Bodies are rendered with the commonmark package; broken/orphan links are surfaced in a per-page footer badge from the validation findings.
okf_html(con, out, single = FALSE, site_title = NULL)okf_html(con, out, single = FALSE, site_title = NULL)
con |
An open DuckDB connection to an okf catalog (from [okf_ingest()]). |
out |
Output directory (site mode) or output '.html' file path (single). |
single |
Emit one self-contained file instead of a per-concept site. |
site_title |
Optional title for the landing page / single-file header; defaults to the bundle directory name. |
A list with 'files' (paths written), 'n_concepts', and 'mode' (invisibly).
Reports direct 'outbound' (concepts it links to), direct 'inbound' (concepts linking to it, i.e. backlinks), and 'transitive' – every concept that can reach it by following resolved links (what a change here could ripple to).
okf_impact(con, path)okf_impact(con, path)
con |
An open DuckDB connection to an okf catalog. |
path |
Bundle-relative concept path. |
A list with 'path', 'outbound', 'inbound', 'transitive' (all sorted character vectors).
Reads, validates, and loads the bundle into the 'okf_bundle', 'okf_concept', 'okf_link', and 'okf_validation' tables of a (file or in-memory) DuckDB database.
okf_ingest( root, db_path = ":memory:", ingested_at = NULL, bundle_id = NULL, source_kind = "dir", subdir = NULL, branch = NULL, incremental = FALSE )okf_ingest( root, db_path = ":memory:", ingested_at = NULL, bundle_id = NULL, source_kind = "dir", subdir = NULL, branch = NULL, incremental = FALSE )
root |
A bundle directory path, a git URL, a tar/zip path or URL, or a bundle list from [okf_read()]. Non-directory sources are fetched via [okf_fetch()] and cleaned up afterwards. |
db_path |
DuckDB path; defaults to in-memory '":memory:"'. |
ingested_at |
Optional ISO-8601 timestamp; defaults to the current time. |
bundle_id |
Optional stable bundle id. |
source_kind |
How the bundle was obtained (e.g. '"dir"'); auto-set for fetched sources. |
subdir |
Optional bundle path within a cloned/extracted source. |
branch |
Optional git branch or tag (git sources only). |
incremental |
Only rewrite concepts whose 'content_hash' changed since a prior ingest of the same bundle into 'db_path' (added/removed handled too); links and validation are always recomputed (they are graph-global). Falls back to a full load if the bundle is not already present. The 'summary' then includes 'changed'/'added'/'removed'/'cached' counts. |
A list with the open 'con', the 'bundle_id', and a 'summary' (counts, conformance, link totals). The caller owns/closes 'con'.
Build the concept graph (resolved and broken links) for a bundle.
okf_links(rd)okf_links(rd)
rd |
A bundle as returned by [okf_read()]. |
A data.frame with 'src_path', 'dst_raw', 'dst_path', 'resolved'.
An embedder is a function of 'texts' returning a list of numeric vectors. Swap in any such function (e.g. an OpenAI client) for [okf_embed()] / [okf_rag()].
okf_ollama_embedder( model = "nomic-embed-text", url = Sys.getenv("OLLAMA_URL", "http://localhost:11434") )okf_ollama_embedder( model = "nomic-embed-text", url = Sys.getenv("OLLAMA_URL", "http://localhost:11434") )
model |
Ollama embedding model name. |
url |
Ollama base URL (defaults to the 'OLLAMA_URL' env var or localhost). |
A function 'texts -> list(numeric)'. Requires the httr2 package.
Parse the YAML frontmatter and body of a single OKF concept file.
okf_parse_file(path)okf_parse_file(path)
path |
Path to a markdown file. |
A list with 'meta' (parsed frontmatter, or 'NULL'), 'body', and 'err' ('NA' on success, else '"no_frontmatter"', '"unclosed_frontmatter"', or '"yaml_parse_error"').
Query helpers over an ingested OKF catalog.
okf_concepts(con) okf_graph_df(con) okf_findings(con) okf_search(con, term)okf_concepts(con) okf_graph_df(con) okf_findings(con) okf_search(con, term)
con |
An open DuckDB connection to an okf catalog. |
term |
Search term for [okf_search()] (matched against concept bodies). |
A data.frame: concepts ([okf_concepts]), link edges ([okf_graph_df]), validation findings ([okf_findings]), or body matches ([okf_search]).
Embeds ‘query' and returns the top-k most cosine-similar chunks (via DuckDB’s native 'list_cosine_similarity'). Run [okf_embed()] first.
okf_rag(con, query, embedder = NULL, k = 5L)okf_rag(con, query, embedder = NULL, k = 5L)
con |
An open DuckDB connection to an embedded okf catalog. |
query |
Query string. |
embedder |
An embedder function; defaults to [okf_ollama_embedder()]. |
k |
Number of results to return. |
A data.frame with 'path', 'title', 'chunk_id', 'score', 'text'.
Read an OKF bundle from a directory into an in-memory representation.
okf_read(root, bundle_id = NULL, source_kind = "dir")okf_read(root, bundle_id = NULL, source_kind = "dir")
root |
Path to the bundle directory. |
bundle_id |
Optional stable id; defaults to a hash of the root path. |
source_kind |
How the bundle was obtained (e.g. '"dir"'). |
A list with 'bundle_id', 'root', 'okf_version', 'source_kind', 'concepts' (parsed per-file records), and 'known' (all concept paths).
Resolve a markdown link target to a bundle-relative concept path.
okf_resolve_link(raw, src_rel, known)okf_resolve_link(raw, src_rel, known)
raw |
Raw link target. |
src_rel |
Bundle-relative path of the linking concept. |
known |
Character vector of all known concept paths in the bundle. |
The resolved bundle-relative path, or 'NA' if it does not resolve.
Hard rules (severity 'error'): parseable frontmatter, non-empty 'type'. Soft findings (severity 'warn'): missing recommended fields, non-ISO timestamps, broken links. Never rejects the bundle – returns findings.
okf_validate(rd)okf_validate(rd)
rd |
A bundle as returned by [okf_read()]. |
A data.frame with 'path', 'severity', 'rule', 'message'.