Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Canon helps you understand and take control of digital assets spread across many drives, backups, and years.

The Problem

Over time, files accumulate across devices: old hard drives, backup folders, cloud downloads, phone exports. Finding what you have, identifying duplicates, and organizing everything into a coherent archive becomes overwhelming.

The Approach

Canon takes a methodical, incremental approach:

  1. Scan your devices to index files and compute content hashes
  2. Enrich with metadata extracted by external tools (EXIF, file types, etc.)
  3. Discover what you have using filters and queries
  4. Archive selected files to a canonical location, at your own pace

Each step is revisitable. You can scan new sources, add more metadata, refine your queries, and archive in small batches. Canon tracks what’s already archived, so you always know your progress.

Key Features

  • Content-based deduplication: Files are identified by their hash, not location
  • Flexible metadata: Import any key-value facts from external tools
  • Powerful filtering: Query by any combination of facts using boolean expressions
  • Safe archiving: Preview operations, validate integrity, and maintain audit trails
  • Incremental workflow: Work at your own pace with full state persistence

Ready to get started? See Setup and Getting Started.

Setup

Installation

Install Canon from crates.io:

cargo install canon-archive

This installs the canon binary.

From Source

Alternatively, build from source:

git clone https://github.com/robklg/canon.git
cd canon
cargo install --path .

Database

Canon stores all state in a SQLite database. The default location is ~/.canon/canon.db.

You can override this with the --db flag:

canon --db /path/to/custom.db scan ...

The database is created automatically on first use. It contains:

  • Registered roots and their scan state
  • All indexed sources with metadata
  • Content hashes and object references
  • Imported facts from enrichment

Verify Installation

canon --help

You should see the list of available commands. You’re ready to start scanning your files.

Getting Started

This guide walks through a typical Canon workflow: scanning files, enriching with metadata, querying, and archiving.

Scanning

First, index your source files and existing archive:

# Add source roots (files you want to organize)
canon scan --add --role source /path/to/photos
canon scan --add --role source /path/to/backup-drive/photos
canon scan --add --role source --comment "Old backup, possibly duplicates" /Volumes/OldDrive

# Add an archive root (your organized destination)
canon scan --add --role archive /Volumes/Archive

By default, Canon computes content hashes during scanning. This enables deduplication and archive tracking.

Enriching

Use external tools to extract metadata. The example below uses exiftool to extract EXIF data including GPS-based geolocation:

canon worklist --where 'source.ext|lowercase IN (jpg, jpeg, heic, mov, mp4)' \
  | ./scripts/exif-worklist.sh \
  | canon import-facts

See Enriching for details on the worklist/import pipeline.

Querying

Discover what facts are available and explore your files:

# See all available facts
canon facts

# Check value distribution for a specific fact
canon facts --key content.geo.region          # Where were photos taken?
canon facts --key "content.DateTimeOriginal|year"  # Which years?

# List files matching filters
canon ls --where 'content.geo.city=Bletchley'

# Preview files (macOS)
canon ls -0 --where 'content.geo.city=Bletchley' | xargs -0 open -a Preview

Archiving

When you find a collection worth archiving, create a manifest:

canon cluster generate \
  --where 'content.DateTimeOriginal|year=2023' \
  --where 'content.geo.region="North Holland"' \
  --dest /Volumes/Archive/Trips/2023-Amsterdam

This creates manifest.toml with the query parameters and a manifest.lock with matching sources.

Edit manifest.toml to customize the output pattern:

[output]
pattern = "{content.DateTimeOriginal|date}/{filename}"
base_dir = "/Volumes/Archive/Trips/2023-Amsterdam"

Preview and apply:

canon apply manifest.toml --dry-run   # Preview what will happen
canon apply manifest.toml             # Execute the copy

Files are copied to the archive with paths like:

/Volumes/Archive/Trips/2023-Amsterdam/2023-06-16/IMG_001.jpg

Next Steps

  • Learn about Concepts to understand how Canon models your files
  • Explore the full Commands reference
  • See Filters for advanced query syntax

Concepts

Understanding these core concepts will help you use Canon effectively.

  • Roots: Storage locations that Canon tracks
  • Source: A file discovered on disk
  • Object: Unique content identified by hash
  • Source vs. Object: How files relate to content
  • Facts: Metadata attached to sources or objects

Source

A source is a file discovered on disk during scanning. Canon tracks:

  • Location: Root path + relative path within the root
  • Identity: Device ID and inode for move detection
  • Metadata: Size and modification time
  • Integrity: Partial hash (first + last 8KB) for validation during transfers
  • State: A basis_rev counter that increments when size or mtime changes

Sources represent where files are found. Multiple sources can point to the same content (see Object) when files are duplicated across locations.

When a source is scanned with hashing enabled (the default), Canon computes its SHA-256 hash and links it to an object. This enables deduplication and archive tracking.

Exclusion

Sources can be marked as excluded to skip them during archiving. A source is considered excluded if:

  • The source itself is marked excluded, OR
  • The source’s linked object is marked excluded

This two-level check means that excluding an object effectively excludes all sources with that content. Object-level exclusion is useful when you want to skip content regardless of where it appears.

Object

An object represents unique content identified by its SHA-256 hash. Objects are content-addressed: two files with identical bytes will have the same hash and thus reference the same object.

Objects enable:

  • Deduplication: Multiple sources can point to the same object
  • Archive tracking: When content exists in an archive, all sources with that hash are marked as archived
  • Fact sharing: Metadata attached to an object is available on all sources with that content

Objects are created automatically when sources are hashed during scanning or enrichment.

Source vs. Object

Understanding the relationship between sources and objects is key to how Canon handles deduplication and archive tracking.

Sources Are Locations

When a root is scanned, Canon indexes every file it finds as a source. Each source represents a specific file at a specific path.

Objects Are Content

When sources are hashed, Canon creates or links them to objects. An object represents the underlying content, independent of where it was found.

Source A: /backup1/photos/IMG_001.jpg  ─┐
Source B: /backup2/old/IMG_001.jpg     ─┼─► Object (hash: abc123...)
Source C: /downloads/photo.jpg         ─┘

All three sources above have identical content, so they reference the same object.

Fact Sharing

When a source is linked to an object:

  • Content facts (like EXIF metadata) can be stored on the object and become available to all sources with that hash
  • Source facts (like file path) remain specific to each source

This allows metadata to flow between different copies of the same content. Import a fact once, and it’s available everywhere that content exists.

Archive Tracking

Canon uses the source-object relationship to track archiving progress:

  • When you archive a file, Canon copies it to an archive root and records the object’s hash
  • Any source with that same hash is now considered “archived”
  • The coverage command shows how many of your sources exist in an archive

Hashing

By default, Canon hashes all files during scanning. Since hashing can be time-consuming for large collections, you can:

  • Use --no-hash during scan to skip hashing initially
  • Hash selectively via the enrichment pipeline, targeting specific file types

Unhashed sources cannot be linked to objects, so they cannot be deduplicated or tracked for archive coverage.

Facts

Facts are key-value metadata attached to sources or objects.

Types of Facts

Built-in facts are collected automatically during scanning:

  • source.ext - File extension
  • source.size - File size in bytes
  • source.mtime - Modification timestamp
  • content.hash.sha256 - Content hash (when computed)

Imported facts come from external tools via the enrichment pipeline:

  • EXIF metadata: content.Make, content.Model, content.DateTimeOriginal
  • Geolocation: content.geo.city, content.geo.country
  • Media info: content.mime, content.duration
  • Any custom key-value pairs you choose to import

Namespaces

Facts are namespaced:

  • source.* - Facts about the file on disk (path, size, timestamps)
  • content.* - Facts about the content itself (stored on objects when hashed)

When querying, the content. prefix is optional: --where 'Make=Apple' is equivalent to --where 'content.Make=Apple'.

Value Types

Canon stores facts as:

  • Text: Strings like "Apple" or "image/jpeg"
  • Numbers: Integers or decimals like 1024 or 3.14
  • Timestamps: Unix timestamps, enabling date modifiers like |year and |month

Type hints can be provided during import to ensure correct parsing. See Enriching for details.

Roots

A root is a directory on a storage device that Canon tracks. Each root is identified by its absolute path and assigned a role.

Roles

Canon distinguishes two root roles:

Source roots contain assets you want to explore, reconcile, or archive. They may be unstructured, incomplete, or contain duplicates. Examples: old backup drives, phone exports, download folders.

Archive roots hold an intentional structure that you maintain. Files archived by Canon are placed here. Examples: your organized photo library, music collection, document archive.

Rules

  • Roots may not overlap (one root cannot be inside another)
  • A root can be any directory, not just a drive or mount point
  • You can have multiple roots of each type
  • Roots can be suspended to temporarily hide them from operations

Typical Setup

Source roots:
  /Volumes/OldBackup       (unorganized photos from 2015)
  /Volumes/PhoneExport     (recent phone backup)
  ~/Downloads/Photos       (miscellaneous downloads)

Archive roots:
  /Volumes/Archive/Photos  (canonical photo library)
  /Volumes/Archive/Music   (canonical music library)

Canon Commands

Common Options

Most commands that operate on sources share these options:

Path scope — Limit a command to a specific directory by passing a path:

canon ls /path/to/photos
canon facts /path/to/photos
canon coverage /path/to/photos

Filters — Select sources using --where with boolean expressions:

canon ls --where 'source.ext=jpg'
canon facts --where 'source.size > 1000000'
canon cluster generate --where 'geo.country=Netherlands' --dest /archive

Multiple --where flags are combined with AND. See Filters for the full syntax.

Command Reference

  • Managing Roots: Add and manage storage locations
    • scan: Scan existing or new roots
    • roots: List, suspend, or remove roots
  • Enriching: Import metadata from external tools
  • Querying: Explore your indexed files
    • ls: List sources matching filters
    • facts: Discover available metadata
    • compare: Compare directories by content
  • Managing Sources: Control which sources are processed
    • exclude: Mark sources to skip during archiving
  • Archiving: Organize files into your canonical archive
    • coverage: Check archive progress
    • cluster: Generate a manifest for archiving
    • apply: Execute the manifest to copy/move files

Managing Roots

To track files in Canon, first you add and scan roots. This makes these sources available for further enrichment or archive operations. You can suspend roots to temporarily mask them from Canon commands.

Adding new roots, or scanning existing is performed through the scan command.

Managing roots, such as suspending or listing them is done with canon roots.

Scan

Scan directories and index files.

When you scan a particular root, Canon will walk the directory tree starting at the given path(s). For each file, basic metadata such as last modification time and size is collected, and (by default) the hash is computed. After scanning, Canon knows about the existence of all sources in that root. If the files were hashed they will be linked to objects.

The hashing process can take quite long, so it is possible to skip that (--no-hash). Not hashing is an option if your intention is to hash selectively, for instance: you’re only interested in certain types of files.

There is no real limit on how many roots you can add. It may be helpful to scan collections of files that belong together as separate roots. Each root can be given a comment, so this can help you recall what is contained, but you can also use this to store some notes about what you discovered in these roots.

If you have an already organized location that you want Canon to treat as your canonical archive, scan it with --role archive from the start. The role is set when the root is added; to change it, you must remove the root and re-add it with the new role. You can add multiple archive roots, for instance one for your music collection and another for your eBooks.

When to run scan

If your filesystem changes regularly, make sure to re-scan your roots with Canon. That way Canon can detect change, and you will not miss files for archiving. Note that, when archiving, Canon always checks the validity of the files to be archived.

Another use case is periodic integrity verification of your archives. Use --verify to recompute hashes for all files and detect corruption. Canon exits with a non-zero status if any mismatches are found, making it suitable for cron jobs that alert on failure.

Examples

# Add a new root and scan it (--add and --role required for new roots)
canon scan --add --role source /path/to/photos

# Scan multiple new roots
canon scan --add --role source /path/to/photos /path/to/more/photos

# Add with a descriptive comment
canon scan --add --role source --comment "Photos from 2020 trip" /path/to/photos

# Add as an archive root (for tracking already-organized files)
canon scan --add --role archive /path/to/archive

# Re-scan an existing root (--role optional, validated against existing)
canon scan /path/to/photos

# Scan just a subtree within an existing root
canon scan /path/to/photos/2024

# Scan without computing hashes (just index files)
canon scan --no-hash /path/to/photos

# Verify archive integrity by recomputing all hashes (good for cron jobs)
canon scan --verify /Volumes/Archive

Hash computation: By default, Canon computes content hashes for new and changed files during scan. This enables deduplication and archive tracking. Use --no-hash to skip hashing if you just want to index files quickly.

Integrity verification: Use --verify to recompute hashes for all files, even unchanged ones. Run periodically (e.g., via cron) to detect file corruption. If a file’s hash changes without its mtime changing, Canon warns about possible corruption and exits with an error.

Discovering untracked directories: Use --candidates to find directories with files that aren’t yet under any root. This is useful when exploring a drive or backup to see what could be added:

# Find candidate roots to add under a path
canon scan --candidates /Volumes/Backup

# Output shows directories with untracked files
Candidate roots to add:
  /Volumes/Backup/photos  (3 directories with files)
  /Volumes/Backup/imports  (1 directory with files)

Directories under existing roots are skipped. When multiple subdirectories share a common ancestor that could be added as a single root, they’re rolled up (unless that ancestor contains an existing root).

Output shows what was found:

Scanned 1234 files: 100 new, 5 updated, 2 moved, 1127 unchanged, 0 missing
Hashed 105 files

canon roots

List and manage registered roots.

Roots are added via scan and managed with the roots command. You can list, suspend/unsuspend, add comments, or remove roots.

Important notes:

  • Removing a root also removes its sources and attached facts from the database
  • Removing a root does not delete any files on disk
  • If you re-add a removed root, you’ll need to re-enrich it
# List all roots with file counts and last scan time
canon roots

# List roots at or beneath a specific path
canon roots /path/to/photos

# List only suspended roots
canon roots --suspended

# Set a comment on a root (omit text to clear)
canon roots comment id:1 "Old backup, possibly duplicates"
canon roots comment id:1

# Suspend a root (hides from all operations without deleting data)
canon roots suspend id:1
canon roots suspend path:/path/to/photos

# Unsuspend a root (make visible again)
canon roots unsuspend id:1

# Remove a root by ID (files on disk are NOT deleted)
canon roots rm id:1

# Remove a root by path
canon roots rm path:/path/to/photos

# Skip confirmation prompt
canon roots rm id:1 --yes

Example output:

ID   ROLE       FILES  LAST SCAN         PATH
1    source     16635  2h ago            /path/to/photos
2    archive   169941  5d ago            /path/to/archive
3    source      1234  never             /path/to/backup (Old backup, possibly duplicates)

Suspending Roots

Suspended roots are hidden from listings, excluded from scan --all, and their sources are excluded from all queries (ls, facts, coverage, worklist, etc.). Suspended roots still prevent overlapping (you cannot add a new root at a suspended root’s path). Use --suspended to list only suspended roots.

Removing Roots

When removing a root, Canon shows how many sources are “in archive” (same content exists in an archive) vs “not in archive”, and suggests using canon ls <path> to preview which sources will be forgotten.

Root Specs

Several commands accept root specifications in two formats:

FormatExampleDescription
id:Nid:1By database ID (shown in canon roots output)
path:/...path:/path/to/photosBy exact path
canon roots suspend id:1
canon roots suspend path:/path/to/photos

Enriching

Add metadata to indexed files using external processors.

Canon uses a pipeline model: worklist outputs sources as JSONL, an external processor extracts metadata, then import-facts stores the results.

canon worklist → processor → canon import-facts

A processor can be any CLI tool or script that extracts information from files: exiftool for EXIF data, file for MIME types, ffprobe for media info, or custom scripts you write yourself.

Basic Usage

Extract EXIF metadata from images:

canon worklist --where 'source.ext|lowercase IN (jpg, jpeg, heic)' \
  | ./scripts/exif-worklist.sh \
  | canon import-facts

Note the --where filter: it’s usually smart to limit the worklist to files the processor can actually handle.

Detect MIME types for all files:

canon worklist | canonargs --fact mime -- file -b --mime-type {} | canon import-facts

After enrichment, the imported facts become available for filtering and querying.

Provided Processors

Canon includes ready-to-use processors:

ProcessorPurposeRequires
scripts/exif-worklist.shEXIF, GPS, and media metadataexiftool, jq
scripts/hash-worklist.shSHA-256 content hashesjq
canonargs --fact mime -- file -b --mime-type {}MIME type detectioncanonargs

Install canonargs with: cargo install canonargs

Going Deeper

Tip: Selective Hashing

Content hashing normally happens during scan. If you prefer to hash only specific file types, use --no-hash during scan and hash selectively via the pipeline:

canon scan --no-hash --add --role source /path/to/mixed-files
canon worklist --where 'mime~"image/*" OR mime~"video/*"' \
  | ./scripts/hash-worklist.sh \
  | canon import-facts

canon worklist

Output sources as JSONL for processing by external tools.

# All sources (from source roots only)
canon worklist

# Only sources missing a content hash
canon worklist --where 'NOT content.hash.sha256?'

# Only JPG files
canon worklist --where 'source.ext=jpg'

# Scope to a specific directory
canon worklist /path/to/photos

# Include sources from archive roots (for backfilling facts)
canon worklist --include-archived

# Include existing facts in output (for chained enrichment)
canon worklist --emit content.geo.lat --emit content.geo.lon

Output Format

Each line is a JSON object with source metadata:

{"source_id":123,"path":"/full/path/to/file.jpg","root_id":1,"size":1024,"mtime":1703980800,"basis_rev":0}
FieldDescription
source_idDatabase ID (pass through to import-facts)
pathFull absolute path to the file
root_idID of the root containing this source
sizeFile size in bytes
mtimeModification time (Unix timestamp)
basis_revRevision counter for staleness detection

Emitting Existing Facts

With --emit, requested facts are included in the output (null if absent):

canon worklist --emit geo.lat --emit geo.lon
{"source_id":123,"path":"/...","basis_rev":0,"facts":{"geo.lat":52.37,"geo.lon":4.89}}
{"source_id":124,"path":"/...","basis_rev":0,"facts":{"geo.lat":null,"geo.lon":null}}

This enables processors to build on previous enrichment:

  • Dependent enrichment: Use extracted coordinates to look up location names
  • Fact combination: Merge data from multiple sources into derived facts

Example: reverse geocoding files that have coordinates but no city name:

canon worklist --emit geo.lat --emit geo.lon --where 'geo.lat? AND NOT geo.city?' \
  | ./scripts/reverse-geocode.sh \
  | canon import-facts

Staleness Detection

The worklist is a snapshot of sources at a point in time. Each entry includes basis_rev which tracks file changes. Processors should pass this through to import-facts, which will skip the import if the file changed since the worklist was generated.

The size and mtime fields allow processors to verify a file hasn’t changed before extracting facts.

canon import-facts

Import facts from JSONL on stdin. Designed to receive output from a processor that consumed a worklist.

canon worklist | some-processor | canon import-facts

# Allow importing facts for sources in archive roots
canon worklist --include-archived | some-processor | canon import-facts --allow-archived

Input Format

Each line must be a JSON object with source_id, basis_rev, and facts:

{"source_id":123,"basis_rev":0,"facts":{"hash.sha256":"abc123...","mime":"image/jpeg"}}
FieldDescription
source_idSource ID from the worklist (required)
basis_revRevision from the worklist for staleness check (required)
factsObject mapping fact keys to values

The processor must pass through source_id and basis_rev from the worklist entry. If basis_rev doesn’t match the source’s current value, the import is skipped (the file changed since the worklist was generated).

Fact Namespacing

Facts are automatically namespaced under content.*. For example, mime becomes content.mime.

The special key hash.sha256 creates or links an object, enabling deduplication and archive tracking.

Type Hints

Types matter. Canon stores facts as text, numbers, or timestamps. The type determines what operations work on a fact:

  • Timestamps enable date modifiers (|year, |month, |date) and date comparisons (>=2024-01-01)
  • Numbers enable numeric comparisons (>1000, <=5.0) and the |bucket modifier
  • Text enables string matching (=, ~ glob) and string modifiers (|lowercase, |stem)

If a datetime like "2024:07:23 11:06:32" is stored as text instead of a timestamp, queries like --where 'DateTimeOriginal|year=2024' won’t work—the modifier expects a timestamp, not a string.

Providing Type Hints

Wrap values in an object with value and type:

{"source_id":123,"basis_rev":0,"facts":{
  "DateTimeOriginal": {"value": "2024:07:23 11:06:32", "type": "datetime"},
  "duration": {"value": "1:23:45", "type": "duration"},
  "rating": 5
}}
TypeParsesStored As
datetimeISO dates, EXIF format, plain years (2024)Unix timestamp
duration"1:23:45", "5:30", or seconds as numberSeconds (number)
(none)Strings as text, numbers as numbersAs-is

Common Pitfalls

Dates as strings: EXIF dates from tools like exiftool come as strings ("2024:07:23 11:06:32"). Without a type hint, they’re stored as text and time modifiers won’t work. Always use "type": "datetime" for date fields.

Mixed types: A fact key must have a consistent type across all sources. You cannot store DateTimeOriginal as text for some files and as a timestamp for others. If you initially imported facts with the wrong type and need to re-import with the correct type, first delete the existing entries:

# Delete all DateTimeOriginal facts that were stored as text
canon facts delete --key content.DateTimeOriginal --type text

Then re-run your processor with proper type hints.

Archive Sources

By default, importing facts for sources in archive roots is skipped. Use --allow-archived to enable this (useful for backfilling metadata on already-archived files).

Writing Processors

Processors are scripts or programs that read worklist entries, extract metadata from files, and output facts for import.

Input and Output

A processor reads JSONL from worklist and writes JSONL for import-facts.

Input (from worklist):

{"source_id":123,"path":"/photos/IMG_001.jpg","basis_rev":0,"size":1024,"mtime":1703980800}

Output (for import-facts):

{"source_id":123,"basis_rev":0,"facts":{"Make":"Apple","Model":"iPhone 12"}}

The processor must pass through source_id and basis_rev unchanged.

Custom Processors

Read JSONL from stdin, extract facts from each file, output JSONL to stdout:

#!/bin/bash
while IFS= read -r line; do
  source_id=$(echo "$line" | jq -r '.source_id')
  basis_rev=$(echo "$line" | jq -r '.basis_rev')
  path=$(echo "$line" | jq -r '.path')

  # Extract facts (example: EXIF data)
  facts=$(exiftool -json -Make -Model "$path" 2>/dev/null | jq '.[0]')

  jq -nc \
    --argjson source_id "$source_id" \
    --argjson basis_rev "$basis_rev" \
    --argjson facts "$facts" \
    '{source_id: $source_id, basis_rev: $basis_rev, facts: $facts}'
done

The canonargs Helper

If you don’t want to handle JSONL parsing and output formatting yourself, canonargs takes care of that. You only provide a command that extracts data from a single file.

Installation

cargo install canonargs

Single Fact Mode

When your command outputs a single value:

canon worklist | canonargs --fact mime -- file -b --mime-type {} | canon import-facts

The {} is replaced with the file path. The command’s stdout becomes the fact value.

Default behavior: Values are stored as text. To specify a type, add --type:

# Store as datetime (enables |year, |month modifiers)
canon worklist | canonargs --fact DateTimeOriginal --type datetime -- exiftool -DateTimeOriginal -s3 {} | canon import-facts

# Store image width as number (using ImageMagick's identify)
canon worklist | canonargs --fact width --type number -- identify -format '%w' {} | canon import-facts

Valid types: datetime, duration, number

Key-Value Mode

When your command outputs key=value pairs (one per line):

canon worklist | canonargs --kv -- my-extractor {} | canon import-facts

Default behavior: All values are stored as text. To specify types, use key:type=value syntax:

width:number=1920
height:number=1080
DateTimeOriginal:datetime=2024:07:23 14:30:00
codec=h264

JSON Mode

When your command outputs a JSON object:

canon worklist | canonargs --json -- exiftool -json {} | canon import-facts

Example extractor output:

{"Make": "Apple", "Model": "iPhone 12", "DateTimeOriginal": "2024:07:23 14:30:00"}

JSON mode auto-detects numbers. If your command outputs "width": 1920 (a JSON number), it’s stored as a number. If it outputs "width": "1920" (a quoted string), it’s stored as text.

For datetime fields, you still need to use the typed hint format:

{"DateTimeOriginal": {"value": "2024:07:23 14:30:00", "type": "datetime"}}

Chaining

Processors can be chained since canonargs passes through the worklist entry and merges facts:

canon worklist \
  | canonargs --fact mime -- file -b --mime-type {} \
  | canonargs --json -- exiftool -json {} \
  | canon import-facts

Using Existing Facts

Processors can access previously imported facts via the --emit flag on worklist. See Emitting Existing Facts for details.

Type Hints

Important: The type of a fact determines what operations work on it:

  • Timestamps enable |year, |month modifiers and date comparisons (>=2024-01-01)
  • Numbers enable numeric comparisons (>1000) and |bucket modifier
  • Text enables string matching and |lowercase, |stem modifiers

If your processor outputs dates as strings or numbers as strings, add type hints:

{"source_id":123,"basis_rev":0,"facts":{
  "DateTimeOriginal": {"value": "2024:07:23 11:06:32", "type": "datetime"},
  "duration": {"value": "1:23:45", "type": "duration"},
  "width": 1920
}}

Without "type": "datetime", a date string like "2024:07:23 11:06:32" is stored as text and --where 'DateTimeOriginal|year=2024' won’t work.

Numbers from JSON are automatically stored as numbers. But if your extractor outputs "width": "1920" (a string), numeric comparisons like --where 'width>1000' won’t work as expected.

See import-facts for full details.

Tips

  • Always pass through source_id and basis_rev unchanged
  • Use jq -c for compact JSON output (one object per line)
  • Handle errors gracefully—skip files that can’t be processed
  • Use type hints for datetime fields so modifiers work correctly
  • Ensure numbers are actual JSON numbers, not quoted strings

Querying

After scanning and enriching, you can explore your indexed files.

All query commands support path scoping (limit to a subdirectory) and --where filters.

canon ls

List sources matching filters. Useful for quick inspection and piping to other tools.

# List all sources in current directory
canon ls .

# List sources matching a filter
canon ls --where 'source.ext=jpg'

# Filter by source ID
canon ls --where 'source.id=12345'

# List only archived sources (content exists in an archive)
canon ls --archived

# List archived sources with their archive location(s)
# Output: source_path<TAB>archive_path (one line per archive location)
canon ls --archived=show

# List only unarchived sources (hashed but not in any archive)
canon ls --unarchived

# List only unhashed sources (no content hash yet)
canon ls --unhashed

# Show duplicate files (same content hash), grouped by hash
canon ls --duplicates

# Include sources from archive roots (automatic when scope is in an archive)
canon ls --include-archived

# Include excluded sources
canon ls --include-excluded

# Long format with size and date
canon ls -l

# Null-delimited output for xargs (handles spaces in paths, macOS)
canon ls -0 --where 'source.ext=jpg' | xargs -0 open -a Preview

Path display:

  • Relative path input (., subdir) → relative output paths
  • Absolute path input (/path/to/dir) → absolute output paths

Output is one path per line (stdout), with a count printed to stderr:

vacation/img001.jpg
vacation/img002.jpg
work/doc.pdf
3 sources

canon facts

Discover what metadata you have and check coverage.

# Overview of all facts (source roots only by default)
canon facts

# Scoped to a directory
canon facts /path/to/photos

# With filters
canon facts --where 'source.ext=jpg'

# Value distribution for a specific fact
canon facts --key content.Make

# With modifiers: group mtime by year-month
canon facts --key source.mtime|yearmonth

# With accessors: distribution by top-level directory
canon facts --key source.rel_path[0]

# Combine accessor and modifier: distribution by filename extension
canon facts --key source.rel_path[-1]|ext

# Show hidden built-in facts
canon facts --all

# Unlimited results (default is 50)
canon facts --key content.hash.sha256 --limit 0

# Include sources from archive roots
canon facts --include-archived

# Group by root (see which roots contribute to each value)
canon facts --key source.ext --by-root

# Group by any fact key (with modifiers)
canon facts --key source.ext --group-by 'source.mtime|year'

# Compound grouping (root + another fact)
canon facts --key source.ext --by-root --group-by 'content.Make'

Example output:

Sources matching filters: 34692

Fact                               Count   Coverage
────────────────────────────────────────────────────
source.ext                         34692     100.0%  (built-in)
source.size                        34692     100.0%  (built-in)
source.mtime                       34692     100.0%  (built-in)
source.path                        34692     100.0%  (built-in)
content.hash.sha256                34692     100.0%
content.mime                       34692     100.0%
content.Model                       7935      22.9%
content.Make                        7935      22.9%
...

Example grouped output (--by-root):

source.ext (by root)

jpg (total: 12,500, 36.0%)
  id:1  ...stack/Backup/Pictures          8,000   64.0%
  id:2  ...castor-import/gringo           4,500   36.0%

png (total: 8,200, 23.6%)
  id:1  ...stack/Backup/Pictures          5,000   61.0%
  id:3  ...castor-import/hydra            3,200   39.0%

canon facts delete

Delete facts by key. Useful for removing incorrect or unwanted metadata.

# Preview deletion (dry-run by default)
canon facts delete content.mime --on object
canon facts delete content.Make --on source /path/to/photos --where 'source.ext=jpg'

# Execute deletion
canon facts delete content.mime --on object --yes
  • --on source or --on object is required to specify entity type
  • Protected namespaces (source.*) cannot be deleted
  • Dry-run by default; use --yes to execute

canon prune

Clean up orphaned or stale data from the database.

# Preview stale facts (file changed since fact was recorded)
canon prune --stale-facts

# Preview orphaned objects (no present sources reference them)
canon prune --orphaned-objects

# Preview facts for excluded sources/objects
canon prune --excluded-facts
canon prune --excluded-facts=source   # Only source facts
canon prune --excluded-facts=object   # Only object facts

# Execute deletion
canon prune --stale-facts --yes
canon prune --orphaned-objects --yes
canon prune --excluded-facts --yes

Stale facts are those where observed_basis_rev no longer matches the source’s current basis_rev (meaning the file was modified after the fact was imported).

Orphaned objects are content entries with no remaining present sources. This can happen when files are deleted. You may want to keep them as a historical record, or delete them to clean up the database.

Excluded facts are metadata for sources or objects you’ve marked as excluded. Since you’ve decided not to archive them, you may want to remove their facts to free up database space.

canon compare

Compare two folders by content hash. Useful for verifying backups or finding differences between directories.

# Compare two directories
canon compare /path/to/folder_a /path/to/folder_b

# With filters
canon compare /path/to/folder_a /path/to/folder_b --where 'source.ext=jpg'

# Show file paths for differences
canon compare /path/to/folder_a /path/to/folder_b --verbose

Output shows:

  • Files only in A (by content)
  • Files only in B (by content)
  • Files in both (matching content hash)

Exit code is 0 if identical, 1 if differences found.

Managing Sources

After scanning and enriching, you may want to control which sources are included in archiving operations.

The exclude command lets you mark sources to skip during cluster generate and apply. This is useful for:

  • Ignoring temporary or system files
  • Skipping known duplicates while keeping a preferred copy
  • Filtering out small files below a size threshold
  • Removing unwanted files from consideration without deleting them

Exclusions are stored directly on sources and can be cleared at any time.

canon exclude

Manage source exclusions. Excluded sources are skipped by most commands.

# Mark sources as excluded (e.g., small files, temp files)
canon exclude set --where 'source.size<1000'
canon exclude set /path/to/photos --where 'source.ext=tmp'

# Exclude a specific file by path
canon exclude set /path/to/photos/unwanted.jpg

# Exclude by source ID (shown in ls --duplicates output)
canon exclude set --id 12345

# Preview what would be excluded
canon exclude set --where 'source.ext=bak' --dry-run

# List currently excluded sources
canon exclude list
canon exclude list /path/to/photos

# Remove exclusions
canon exclude clear
canon exclude clear --where 'source.ext=tmp'

# Preview what would be cleared
canon exclude clear --where 'source.ext=tmp' --dry-run

canon exclude duplicates

Automatically exclude duplicate files while keeping copies in a preferred location.

# Exclude duplicates, keeping files under /preferred/path
canon exclude duplicates /scope/path --prefer /preferred/path

# Preview what would be excluded
canon exclude duplicates /scope/path --prefer /preferred/path --dry-run

# With filters
canon exclude duplicates /scope/path --prefer /preferred/path --where 'source.ext=jpg'

This is useful for deduplicating across backup drives while keeping the “canonical” copy in your preferred location.

How exclusions affect other commands:

CommandDefault behaviorOverride
worklistSkips excluded--include-excluded
factsSkips excluded, shows count--include-excluded
coverageStats on included only--include-excluded shows excluded dimension
cluster generateAlways skips excludedNo override (hard gate)
applyBlocks if manifest has excludedNo override (hard gate)

Exclusions are stored directly on sources and objects in the database.

Archiving

When you find a collection of files to archive, Canon uses a two-step process:

  1. Generate a manifest with cluster - select files and define the destination
  2. Apply the manifest with apply - copy or move files to the archive

This workflow lets you review and customize the output before committing to any file operations.

  • coverage - Check how much has been archived
  • cluster - Generate a manifest for a set of files
  • apply - Execute the manifest to copy/move files

canon coverage

Show archive coverage statistics - how many sources are hashed and how many are archived.

# Overview of all source roots
canon coverage

# Scoped to a specific directory
canon coverage /path/to/photos

# With filters
canon coverage --where 'source.ext=jpg'

# Coverage relative to a specific archive root
canon coverage --archive id:1
canon coverage --archive path:/path/to/archive

# Include archive roots in analysis
canon coverage --include-archived

Example output:

Archive Coverage Report

Root: /path/to/backup1 (source)
  Total sources:     1,234
  Hashed:            1,100 (89.1%)
  Archived:            850 (77.3% of hashed)
  Unarchived:          250

Root: /path/to/backup2 (source)
  Total sources:       567
  Hashed:              500 (88.2%)
  Archived:            400 (80.0% of hashed)
  Unarchived:          100

────────────────────────────────────────
Overall:
  Total sources:     1,801
  Hashed:            1,600 (88.8%)
  Archived:          1,250 (78.1% of hashed)
  Unarchived:          350
  • Hashed: Sources with a content hash (ready for archiving)
  • Archived: Sources whose content exists in an archive root
  • With --archive: Shows “In this archive” vs “Not in archive” for that specific archive

canon cluster generate

Generate a manifest of files matching filters. The --dest flag specifies where files will be copied and must be inside a registered archive root.

# All photos to an archive (unhashed sources are automatically skipped)
canon cluster generate --where 'source.ext IN (jpg, png, heic)' --dest /Volumes/Archive/Photos

# Destination can be a subdirectory within an archive
canon cluster generate --where 'source.ext IN (jpg, png, heic)' --dest /Volumes/Archive/Photos/2024

# Scope to a specific path
canon cluster generate /path/to/photos --dest /Volumes/Archive

# Custom output file
canon cluster generate --where 'source.ext=jpg' --dest /Volumes/Archive -o my-manifest.toml

# Include sources from archive roots
canon cluster generate --where 'source.ext=jpg' --dest /Volumes/Archive --include-archived

# Show which files were excluded (already archived)
canon cluster generate --where 'source.ext=jpg' --dest /Volumes/Archive --show-archived

# Overwrite existing manifest file
canon cluster generate --where 'source.ext=jpg' --dest /Volumes/Archive --force

The command generates two files: a manifest (.toml) that you edit, and a lock file (.lock) containing the source list.

Typical workflow:

canon cluster generate --where 'source.ext IN (jpg, png, heic)' --dest /Volumes/Archive
# Edit manifest.toml to customize the output pattern
canon apply manifest.toml --dry-run   # Preview
canon apply manifest.toml             # Execute

Manifest structure:

The generated manifest includes helpful comments listing all available pattern variables, modifiers, and aliases based on the facts present in your sources:

# Available facts for pattern (100% coverage on 1234 sources):
#
# Built-in:
#   filename           text   - Filename (last path component)
#   source.ext         text   - File extension
#   source.mtime       time   - Modification time
#   ...
#
# Content facts:
#   content.Make       text
#   content.Model      text
#   ...
#
# Modifiers:
#   Time: |year |month |day |date ...
#   String: |stem |ext |lowercase ...

[output]
pattern = "{filename}"           # ← Edit this to customize organization
base_dir = "/Volumes/Archive"
archive_root_id = 2

Common output patterns:

# Flat (default) - all files in base_dir
pattern = "{filename}"

# Preserve original folder structure (relocate as-is)
pattern = "{source.rel_path}"

# By EXIF date
pattern = "{content.DateTimeOriginal|year}/{content.DateTimeOriginal|month}/{filename}"

# By EXIF date with hash prefix (avoids collisions)
pattern = "{content.DateTimeOriginal|year}/{content.DateTimeOriginal|month}/{hash_short}_{filename}"

# By camera model
pattern = "{content.Make}/{content.Model}/{filename}"

# By file type
pattern = "{source.ext}/{filename}"

See Pattern Expressions for the full syntax reference, including modifiers, path accessors, and aliases.

Refreshing the Lock File

Use canon cluster refresh to update the lock file if sources have changed since the manifest was generated:

# Re-query and update the lock file
canon cluster refresh manifest.toml

This re-runs the manifest’s query and updates manifest.lock with the current matching sources. The manifest settings remain unchanged.

canon apply

Apply a manifest to copy/move files. Copied files are automatically registered in the database with the same content hash, so they’re immediately recognized as archived (no separate scan needed).

# Preview what would happen (fast - skips source existence checks)
canon apply manifest.toml --dry-run

# Copy files (default mode, preserves mtime/permissions on Unix)
canon apply manifest.toml

# Show per-file progress during transfer
canon apply manifest.toml --verbose

# Rename files instead of copying (Unix only, fails on cross-device)
canon apply manifest.toml --rename

# Move files: rename if same device, copy+delete if cross-device
canon apply manifest.toml --move --yes

# Only apply sources from specific roots
canon apply manifest.toml --root id:1 --root id:2
canon apply manifest.toml --root path:/path/to/source

# Allow duplicates across archives (but not within destination)
canon apply manifest.toml --allow-cross-archive-duplicates

Transfer modes:

FlagBehavior
(default)Copy + preserve mtime/permissions (Unix)
--renameAtomic rename; fails if cross-device (Unix only)
--moveTry rename; fallback to copy+delete on cross-device (Unix only, requires --yes)

All modes use noclobber semantics: if a destination file exists, apply aborts with an error.

Integrity validation:

During transfer, Canon validates each source file’s partial hash (first 8KB + last 8KB) to detect file corruption or modification since the manifest was generated. If validation fails, the transfer is aborted.

Root filtering:

Use --root to apply only a subset of sources from the manifest. Useful for staged application when sources are on different drives.

  • --root id:N - Filter by root ID (shown in manifest as root_id)
  • --root path:/path - Filter by root path (must match exactly)

Pre-flight checks (mandatory):

  1. Destination collisions - If multiple sources would map to the same destination path (e.g., using {filename} when sources have duplicate names), apply aborts with an error showing which files conflict.

  2. Archive conflicts - Checks if files already exist in the destination archive or other archives.

  3. Excluded sources - Blocks if any sources in the manifest are marked as excluded.

Edit the manifest’s [output] section to customize the destination:

[output]
pattern = "{content.DateTimeOriginal|year}/{content.DateTimeOriginal|month}/{filename}"
base_dir = "/path/to/archive"

Pattern variables use fact keys with optional modifiers (see Pattern Expressions for the full syntax):

  • {filename}, {stem}, {ext} - Filename aliases
  • {hash}, {hash_short} - Content hash aliases
  • {source.mtime|year}, {source.mtime|month} - File modification date
  • {content.DateTimeOriginal|year} - EXIF date with modifier
  • {content.Make}, {content.Model} - Any fact key

Facts Reference

Facts are key-value metadata. See Concepts: Facts for an overview.

Namespaces

NamespaceDescription
source.*Facts about the file on disk (path, size, mtime)
content.*Facts about the content (hash, EXIF, mime type)
object.*Object-level properties

The content. prefix is optional when querying. For example, Make=Apple is equivalent to content.Make=Apple.

Values

Facts can hold three value types:

TypeExamplesNotes
Text"Apple", "image/jpeg"Strings; quote if contains spaces
Number1024, 3.14, -5Integers or decimals
Timestamp1704067200Unix timestamps; enable date modifiers

Modifiers

Transform values using | syntax:

Time Modifiers

For timestamp values (like source.mtime or EXIF dates):

ModifierOutputExample
year4-digit year2024
month2-digit month07
day2-digit day23
hour2-digit hour (24h)14
minute2-digit minute30
second2-digit second45
dateISO date2024-07-23
timeISO time14:30:45
datetimeISO datetime2024-07-23T14:30:45
yearmonthYear-month2024-07
weekISO week number30
weekdayDay of week (Mon=1)2
quarterQuarter (1-4)3

String Modifiers

ModifierDescriptionExample
lowercaseConvert to lowercaseJPGjpg
uppercaseConvert to uppercasejpgJPG
capitalizeCapitalize first letterappleApple
stemFilename without extensionphoto.jpgphoto
extFile extensionphoto.jpgjpg
shortFirst 8 charactersabc123def456abc123de

Numeric Modifiers

ModifierDescription
bucketGroup into ranges (1-10, 10-100, etc.)
bucket(a,b,c)Custom ranges (<a, a-b, b-c, >c)

Example: source.size|bucket groups file sizes into human-readable ranges.

Path Accessors

Python-style indexing for path values:

SyntaxMeaning
key[-1]Last segment (filename)
key[0]First segment
key[1:3]Slice segments 1 and 2
key[:-1]All but last segment

Accessors can be combined with modifiers:

source.rel_path[-1]        → IMG_001.jpg
source.rel_path[-1]|stem   → IMG_001
source.rel_path[0]         → photos

Pruning Facts

The canon prune command can delete facts to free database space.

Excluded Entity Facts

Delete facts for sources or objects you’ve excluded:

# Dry-run: show what would be deleted (default)
canon prune --excluded-facts

# Delete facts for both excluded sources and objects
canon prune --excluded-facts --yes

# Delete only source facts (excluded sources)
canon prune --excluded-facts=source --yes

# Delete only object facts (excluded objects)
canon prune --excluded-facts=object --yes

This is useful when you’ve excluded sources/objects you’re not interested in archiving and want to reclaim the database space used by their metadata.

Other Prune Options

FlagDescription
--stale-factsDelete source facts where the file changed since recording
--orphaned-objectsDelete objects with no present sources (and their facts)

All prune operations are dry-run by default. Add --yes to execute.

See Also

Built-in Facts Reference

These facts are automatically available for all sources without enrichment.

Source Facts

FactTypeDescription
source.idnumDatabase ID (hidden*)
source.exttextFile extension (lowercase, no dot)
source.sizenumFile size in bytes
source.mtimetimeModification timestamp
source.pathpathFull absolute path
source.rootpathRoot directory path (hidden)
source.rel_pathpathPath relative to root (hidden)
source.devicenumDevice ID (hidden)
source.inodenumInode number (hidden)

Content Facts

FactTypeDescription
content.hash.sha256textSHA-256 content hash

Pattern Aliases

These aliases are available in pattern expressions:

AliasExpands To
filenamesource.rel_path[-1]
stemsource.rel_path[-1]|stem
extsource.rel_path[-1]|ext
hashcontent.hash.sha256
hash_shortcontent.hash.sha256|short
idsource.id

*Hidden facts are not shown in canon facts by default. Use --all to include them.

Filter Syntax

Filters select sources based on facts using a boolean expression language. Most commands accept --where to filter which sources they operate on. Multiple --where flags are combined with AND.

Operators

Basic

SyntaxMeaning
key?Fact exists
key=valueFact equals value (case-sensitive)
key!=valueFact doesn’t equal value (case-sensitive)
key~patternGlob pattern match (case-sensitive)
key!~patternGlob pattern doesn’t match
key>valueGreater than (numbers/dates)
key>=valueGreater or equal
key<valueLess than
key<=valueLess or equal
key IN (v1, v2, ...)Fact matches any value in list
key NOT IN (v1, v2, ...)Fact doesn’t match any value in list

Glob Patterns

The ~ operator supports shell-style glob patterns:

PatternMeaning
*Match zero or more characters
?Match exactly one character
[abc]Match any character in set
[a-z]Match character range
[!abc]Match any character NOT in set
\*Literal asterisk (escape)
# Files starting with IMG_
--where 'filename~"IMG_*"'

# Files with 3-letter extension
--where 'source.ext~"???"'

# Files in a year subdirectory
--where 'source.rel_path~"*/2024/*"'

# Exclude temp files
--where 'filename!~"*.tmp"'

Boolean Operators

SyntaxMeaning
expr AND exprBoth conditions must match
expr OR exprEither condition matches
NOT exprNegates the condition
(expr)Grouping for precedence

Operator precedence (highest to lowest): NOT, AND, OR. Use parentheses to override.

Using Modifiers

Modifiers can be applied to fact keys using the | syntax. See Facts for the complete list.

# Files from 2024
--where 'source.mtime|year=2024'

# January photos
--where 'content.DateTimeOriginal|month=1'

# Case-insensitive extension matching
--where 'source.ext|lowercase=jpg'

# Case-insensitive glob
--where 'filename|lowercase~"img_*"'

Examples

# Files with a content hash
--where 'content.hash.sha256?'

# Files missing a content hash
--where 'NOT content.hash.sha256?'

# JPG files only
--where 'source.ext=jpg'

# JPG or PNG files
--where 'source.ext=jpg OR source.ext=png'

# Common image formats
--where 'source.ext IN (jpg, png, gif, webp)'

# Exclude certain extensions
--where 'source.ext NOT IN (tmp, bak, log)'

# Not temporary files
--where 'NOT source.ext=tmp'

# iPhone photos (content. prefix is optional)
--where 'Make=Apple'

# Files larger than 1MB
--where 'source.size>1000000'

# Files modified in 2024 or later
--where 'source.mtime>=2024-01-01'

# Large images (combining with parentheses)
--where '(source.ext=jpg OR source.ext=png) AND source.size>1000000'

# Multiple --where flags combine with AND
--where 'source.ext=jpg' --where 'content.Make=Apple'

Pattern Expressions

Pattern expressions define how files are organized in archives. They use {expr} syntax to insert dynamic values based on facts.

Patterns are used in the pattern field of cluster manifests. When you run canon cluster generate, it creates a manifest with a default pattern = "{filename}" that you can customize.

Basic Syntax

Patterns consist of literal path segments and expressions in curly braces:

{content.DateTimeOriginal|year}/{content.DateTimeOriginal|month}/{filename}

This would produce paths like: 2024/07/IMG_001.jpg

Fact Keys

Any fact key can be used in a pattern:

  • {source.ext} - File extension
  • {source.mtime} - Modification time
  • {content.Make} - Camera manufacturer (from EXIF)
  • {content.hash.sha256} - Content hash

The content. prefix is optional for content facts, so {Make} is equivalent to {content.Make}.

Modifiers

Transform values using the | syntax. See Facts for the complete list.

{source.mtime|year}           → 2024
{source.mtime|yearmonth}      → 2024-07
{content.hash.sha256|short}   → a1b2c3d4
{source.ext|uppercase}        → JPG

Multiple modifiers can be chained:

{filename|stem|lowercase}     → img_001

Path Accessors

Extract segments from path values using Python-style indexing:

SyntaxMeaning
key[-1]Last segment (filename)
key[0]First segment
key[1:3]Slice segments 1 and 2
key[:-1]All but last segment

Examples with source.rel_path = "photos/2024/vacation/IMG_001.jpg":

{source.rel_path[-1]}         → IMG_001.jpg
{source.rel_path[0]}          → photos
{source.rel_path[1:-1]}       → 2024/vacation
{source.rel_path[-1]|stem}    → IMG_001

Aliases

Aliases provide shorthand for common expressions. Use canon facts --show-aliases to see all available aliases.

AliasExpands To
filenamesource.rel_path[-1]
stemsource.rel_path[-1]|stem
extsource.rel_path[-1]|ext
hashcontent.hash.sha256
hash_shortcontent.hash.sha256|short
idsource.id

Example using aliases:

{hash_short}_{filename}       → a1b2c3d4_IMG_001.jpg

Missing Values

Canon requires all facts used in a pattern to have values for every source. If any source is missing a required fact, canon apply will refuse to proceed and report which facts are missing.

When you run canon cluster generate, the manifest includes comments listing all facts with 100% coverage—these are safe to use in your pattern.

If sources are missing required facts, you can:

  • Filter them out during generation: --where 'DateTimeOriginal?'
  • Import the missing facts via the enrichment pipeline

Common Patterns

# Flat (all files in one directory)
pattern = "{filename}"

# Preserve original structure
pattern = "{source.rel_path}"

# By EXIF capture date
pattern = "{content.DateTimeOriginal|year}/{content.DateTimeOriginal|month}/{filename}"

# By date with hash prefix (collision-safe)
pattern = "{content.DateTimeOriginal|date}/{hash_short}_{filename}"

# By camera
pattern = "{content.Make}/{content.Model}/{filename}"

# By file type and year
pattern = "{source.ext}/{source.mtime|year}/{filename}"