Skip to content

Column statistics

Column statistics are not templates. They describe tabular and database columns from sampled rows and appear in format-specific metadata blocks. Delimited files and columnar binaries use a compact shared shape; SQLite uses a richer per-column struct.

Compact columns array (ColumnStat)

Used by CSV/TSV/tab/psv and columnar formats (Parquet, Arrow IPC, Avro, ORC) via flattened ColumnarCommonFields:

json
{
  "row_count": 50000,
  "column_count": 12,
  "stats_rows_sampled": 2000,
  "columns": [
    {
      "i": 0,
      "name": "user_id",
      "t": "number",
      "pt": "Int64",
      "null_pct": 0.0,
      "num": {
        "min": 1,
        "max": 98421,
        "mean": 44102.5,
        "median": 43800,
        "range": 98420,
        "iqr": 22000,
        "stdev": 28104.2
      }
    },
    {
      "i": 3,
      "name": "status",
      "t": "string",
      "null_pct": 1.2,
      "uniq": 4
    },
    {
      "i": 7,
      "name": "created_at",
      "t": "date",
      "date": {
        "span_days": 412.5,
        "span_minutes": 594000,
        "min": "2024-01-02T00:00:00Z",
        "max": "2025-02-18T23:59:59Z"
      }
    },
    {
      "i": 9,
      "name": "active",
      "t": "boolean",
      "bool": { "true_pct": 73.4 }
    }
  ]
}
ColumnStat fieldMeaning
iColumn index
nameColumn name when known
tInferred logical type: number, string, date, boolean, …
ptPhysical / schema dtype when available (e.g. Arrow Int32, Parquet type)
null_pctNull rate in the sample (omitted if not computed)
uniqDistinct non-null string values (string columns only)
numNumeric stats
boolBoolean stats (true_pct in JSON)
dateDate stats

Table-level fields

FieldMeaning
row_countTotal rows when known
column_countNumber of columns
stats_rows_sampledRows used for stats (sample may be smaller than row_count)
encodingText encoding when relevant

CSV-specific fields on csv_metadata: delimiter, quote_character, escape_character, has_header.

Sampling scales with file size, table width, and bytes-per-row when row count is known early — wide tables cap how many columns or rows are fully profiled.

Typed stat objects

Numeric stats

min, max, mean, median, range, iqr, stdev — used for numeric CSV/columnar columns and SQLite INTEGER/REAL.

Date stats

span_days, span_minutes, min, max (ISO-style strings) — parsed date columns in CSV or TEXT SQLite columns.

Boolean stats

true_pct (alias true_percentage) — share of true-like values (true, yes, 1, y, …).

Text stats (SQLite)

min_length, max_length, avg_length — for TEXT columns.

Blob stats (SQLite)

min_size, max_size, avg_size (bytes) — for BLOB columns.

SQLite columns

sqlite_metadata adds schema-oriented data: tables, primary/foreign keys, indexes, row counts. Each column is a ColumnInfo object (verbose names, not the compact i/t form):

FieldMeaning
name, type_nameColumn name and declared SQLite type
not_null, default_valueConstraints
is_primary_key, is_foreign_keyKey roles
null_percentage, unique_countSampled null % and distinct count
numeric_stats, date_stats, boolean_stats, text_stats, blob_statsTyped stats (boolean vs numeric are mutually exclusive)

Column stats without columns

Some formats expose shapes and dtypes instead of a columns array — NumPy (.npy/.npz), HDF5, NetCDF, Zarr, Matrix Market, MATLAB, Tetration .tet, model files. See Metadata extraction.

UBLX display

UBLX renders compact columns JSON as typed tables in Metadata / Writing (when column stats are present). Config key typed_column_tables: none | abbrev (default, caps wide tables at 20 rows) | full.

Related: Template mining overview, Metadata extraction.

UBLX · Nefaxer · ZahirScan