Skip to content

Template mining

ZahirScan compresses repetitive structure into templates and attaches metadata blocks per file type. Use the sidebar under Template mining to jump between topics.

Not every format produces templates. Many files return templates: [] and instead ship metadata (including columns stats for tables). UBLX maps these fields to the Templates, Writing, and Metadata panes.

TopicPage
Templates, compression, categoriesThis page
Writing footprintProse/style metrics (not templates)
Column statisticsTabular & SQLite column profiles

Templates

A template is one recurring pattern with placeholders. They are stored in the top-level templates array (and mirrored inside the phase-2 MiningResult when present).

Template object

json
{
  "pattern": "[DATE] [TIME] ERROR: Process [POS:3] failed",
  "count": 842,
  "examples": {
    "POS:3": ["1234", "5678", "9012"]
  }
}
FieldMeaning
patternStructure with bracketed placeholders — e.g. [POS:n], [HEADER:2], [LIST:12]:type=unordered, [WORD:0]
countLines, JSON records, or structural elements matching this pattern
examplesSample values per placeholder key (BTreeMap, keys sorted; count capped by config)

Output sorts templates by count descending (most common first).

Placeholder families (by category)

CategoryTypical placeholders
Logs[POS:n], typed date/time tokens, literals for fixed columns
Plain text / HTML / EPUB body[WORD:n], [PREFIX], [SUFFIX], shape groups like length + ending punctuation
Markdown[HEADER:level], [LIST:n]:type=…, [CODE_BLOCK:…]:lang=…, [PARAGRAPH:n]:quotes=…
JSONKey paths with dynamic value slots; stable keys stay literal in the pattern string

Mining result & compression

Phase 2 builds a MiningResult (embedded in Zahir output alongside top-level fields):

FieldDescription
templatesMined patterns; [] if format skips mining or nothing repeats
original_tokensEstimated source tokens
compressed_tokensEstimated tokens after template representation
token_reduction_percentPercent reduction vs original
writing_footprintOptional — see Writing footprint

Top-level Output always includes templates. Full mode (-f) also adds file-level stats and compression:

json
{
  "compression": {
    "original_tokens": 120000,
    "compressed_tokens": 18000,
    "reduction_percent": 85.0
  }
}

Typical reduction on template-eligible content is 80–95%. Column-heavy files gain more from column statistics than from templates.

Results per category

Categories that mine templates

CategoryFormatsMining approachWriting footprint
LogsPlain, JSON logs, structured logsPer-line token positions; static tokens literal, variable slots → placeholdersNo
Plain text.txt, generic textSentence n-grams / phrases; shape fallback if no exact matchYes
Markdown.mdDocument structure (headers, lists, code blocks) + sentence shapes in paragraphsYes
HTML.html, .htmVisible body text → same as plain textYes
JSON.jsonPer-value patterns from key/value frequency across recordsNo
EPUB.epubSpine body text → plain-text pipeline (skipped if DRM/encrypted)Yes when body parsed

Example patterns

text
# Log line
2025-01-15T10:00:01Z ERROR service [POS:4] connection reset

# Markdown structure
[HEADER:2]
[LIST:8]:type=unordered
[PARAGRAPH:32]:quotes=true

# JSON record (illustrative)
{"level":"INFO","service":"api","msg":"[WORD:2]"}

# Shape fallback (plain text / HTML / EPUB)
[SHAPE:14_words_period]

Categories with metadata only (templates: [])

CategoryFormatsPrimary JSON blocks
Delimited textCSV, TSV, tab, psvcsv_metadata + column stats
Columnar binaryParquet, Arrow, Avro, ORCparquet_metadata, arrow_ipc_metadata, … + columns
Scientific arraysNumPy, HDF5, NetCDF, Zarr, .tet, MATLAB, MTXFormat-specific *_metadata (shapes, dtypes, catalogs)
DocumentsDOCX, XLSX, PPTX, PDF, EPUB (metadata path)docx_metadata, pdf_metadata, …
SettingsINI, TOML, YAML, XMLRecursive schema stats, no line templates
SQLite.db, …sqlite_metadata with per-table column info
ModelsONNX, GGUF, TFLite, SafetensorsGraph/tensor summaries
Archives / media / codeZIP, TAR, images, video, audio, source filesEntry lists, probes, code_metadata

See Metadata extraction for field-by-field descriptions.

Output modes

ModeCLITemplatesWriting footprintColumn / format metadata
Templates-only (default)(none)Text-like onlyWhen the format pipeline provides it
Full-fAll *_metadata blocks + compression, timings, byte/line counts

Use full for debugging; use templates-only for production, UBLX enhance, and sharing smaller JSON.

Privacy

-r / --redact replaces filesystem paths with ***/filename.ext in output JSON.

Related: Writing footprint, Column statistics, Metadata extraction, CLI.

UBLX · Nefaxer · ZahirScan