Introduction
Canon is a CLI tool for organizing large collections of files — photos, music, documents — scattered across old hard drives, backup folders, cloud downloads, and phone exports. It indexes files across any number of locations, identifies duplicates by content hash, and lets you query and filter everything with metadata. When you’re ready, it safely archives what matters to an organized destination.
The Problem
Files accumulate over years and across devices. Backup drives pile up. You know there are things worth keeping in there, but the scale makes it hard to even start. Manual approaches are risky — one wrong move and something irreplaceable could be gone. So the drives keep sitting in drawers.
The Approach
Canon takes a methodical, incremental approach:
- Scan directories to index files and compute content hashes
- Enrich with metadata extracted by external tools (EXIF, file types, etc.)
- Explore what you have using filters and queries
- Archive selected files to a canonical location, at your own pace
Each step is revisitable. You can scan new drives, add more metadata, refine your queries, and archive in small batches. Canon tracks what’s already archived, so you always know your progress.
Canon never modifies or moves your source files. Every operation that changes anything has dry-run, preview, and confirmation. You can point it at a drive and explore freely without risk.
Key Features
- Content-based deduplication: Files are identified by their content hash, not by name or location — the same photo in three backup folders is recognized as one thing
- Flexible metadata: Import any key-value facts from external tools (EXIF data, MIME types, geolocation, or anything you want)
- Powerful filtering: Query by any combination of facts using boolean expressions and aliases
- Safe archiving: Preview operations with
--dry-run, validate integrity during transfer, and track what’s been archived - Incremental workflow: Work at your own pace — scan a drive today, enrich it next week, archive a batch next month
Ready to get started? See Setup and Getting Started.
Setup
Installation
Install Canon from crates.io:
cargo install canon-archive
This installs the canon binary.
From Source
Alternatively, build from source:
git clone https://github.com/robklg/canon.git
cd canon
cargo install --path .
Canon Home Directory
Canon stores all state in a single directory called the canon home. The default location is ~/.canon/.
It contains:
| File | Purpose |
|---|---|
canon.db | SQLite database (roots, sources, objects, facts) |
aliases.toml | Filter aliases (optional — see Aliases) |
The directory is created automatically on first use.
Overriding the Location
You can relocate canon home with the CANON_HOME environment variable or the --canon-home flag:
# Via environment variable
export CANON_HOME=/mnt/archive/.canon
canon scan /photos
# Via flag (takes precedence over environment variable)
canon --canon-home /tmp/test-canon scan /photos
Precedence: --canon-home flag > CANON_HOME env var > ~/.canon/
Verify Installation
canon --help
You should see the list of available commands. You’re ready to start scanning your files.
Getting Started
This guide walks through a typical Canon workflow: scanning files, enriching with metadata, querying, and archiving.
Scanning
First, index your source files and existing archive:
# Add source roots (files you want to organize)
canon scan --add --role source /path/to/photos
canon scan --add --role source /path/to/backup-drive/photos
canon scan --add --role source --comment "Old backup, possibly duplicates" /Volumes/OldDrive
# Add an archive root (your organized destination)
canon scan --add --role archive /Volumes/Archive
By default, Canon computes content hashes during scanning. This enables deduplication and archive tracking.
Enriching
Use external tools to extract metadata. The example below uses exiftool to extract EXIF data including GPS-based geolocation:
canon worklist --where 'source.ext|lowercase IN (jpg, jpeg, heic, mov, mp4)' \
| ./scripts/exif-worklist.sh \
| canon import-facts
See Enriching for details on the worklist/import pipeline.
Querying
Discover what facts are available and explore your files:
# See all available facts
canon facts
# Check value distribution for a specific fact
canon facts --key content.geo.region # Where were photos taken?
canon facts --key "content.DateTimeOriginal|year" # Which years?
# List files matching filters
canon ls --where 'content.geo.city=Bletchley'
# Preview files (macOS)
canon ls -0 --where 'content.geo.city=Bletchley' | xargs -0 open -a Preview
Archiving
When you find a collection worth archiving, create a manifest:
canon cluster generate \
--where 'content.DateTimeOriginal|year=2023' \
--where 'content.geo.region="North Holland"' \
--dest /Volumes/Archive/Trips/2023-Amsterdam
This creates manifest.toml with the query parameters and a manifest.lock with matching sources.
Edit manifest.toml to customize the output pattern:
[output]
pattern = "{content.DateTimeOriginal|date}/{filename}"
base_dir = "/Volumes/Archive/Trips/2023-Amsterdam"
Preview and apply:
canon apply manifest.toml --dry-run # Preview what will happen
canon apply manifest.toml # Execute the copy
Files are copied to the archive with paths like:
/Volumes/Archive/Trips/2023-Amsterdam/2023-06-16/IMG_001.jpg
Next Steps
- Learn about Concepts to understand how Canon models your files
- Explore the full Commands reference
- See Filters for advanced query syntax
Concepts
Understanding these core concepts will help you use Canon effectively.
- Roots: Storage locations that Canon tracks
- Sources: Files discovered on disk
- Objects: Unique content identified by hash
- Sources vs. Objects: How files relate to content
- Facts: Metadata attached to sources or objects
Roots
A root is a directory on a storage device that Canon tracks. Each root is identified by its absolute path and assigned a role.
Roles
Canon distinguishes two root roles:
Source roots contain assets you want to explore, reconcile, or archive. They may be unstructured, incomplete, or contain duplicates. Examples: old backup drives, phone exports, download folders.
Archive roots hold an intentional structure that you maintain. Files archived by Canon are placed here. Examples: your organized photo library, music collection, document archive.
Rules
- Roots may not overlap (one root cannot be inside another)
- A root can be any directory, not just a drive or mount point
- You can have multiple roots of each type
- Roots can be suspended to temporarily hide them from operations
Typical Setup
Source roots:
/Volumes/OldBackup (unorganized photos from 2015)
/Volumes/PhoneExport (recent phone backup)
~/Downloads/Photos (miscellaneous downloads)
Archive roots:
/Volumes/Archive/Photos (canonical photo library)
/Volumes/Archive/Music (canonical music library)
Offline Access
Query commands (ls, facts, coverage, worklist, compare, cluster generate, exclude, roots) work even when the underlying storage is detached. Canon resolves path arguments against known roots in the database, so you can explore sources, check coverage, and generate manifests without the storage being physically attached.
Commands that access file contents (scan, apply) still require the storage to be online.
Source
A source is a file discovered on disk during scanning. Canon tracks:
- Location: Root path + relative path within the root
- Identity: Device ID and inode for move detection
- Metadata: Size and modification time
- Integrity: Partial hash (first + last 8KB) for validation during transfers
- State: A
basis_revcounter that increments when size or mtime changes
Sources represent where files are found. Multiple sources can point to the same content (see Object) when files are duplicated across locations.
When a source is scanned with hashing enabled (the default), Canon computes its SHA-256 hash and links it to an object. This enables deduplication and archive tracking.
Exclusion
Sources can be marked as excluded to skip them during archiving. A source is considered excluded if:
- The source itself is marked excluded, OR
- The source’s linked object is marked excluded
This two-level check means that excluding an object effectively excludes all sources with that content. Object-level exclusion is useful when you want to skip content regardless of where it appears.
Object
An object represents unique content identified by its SHA-256 hash. Objects are content-addressed: two files with identical bytes will have the same hash and thus reference the same object.
Objects enable:
- Deduplication: Multiple sources can point to the same object
- Archive tracking: When content exists in an archive, all sources with that hash are marked as archived
- Fact sharing: Metadata attached to an object is available on all sources with that content
Objects are created automatically when sources are hashed during scanning or enrichment.
Source vs. Object
Understanding the relationship between sources and objects is key to how Canon handles deduplication and archive tracking.
Sources Are Locations
When a root is scanned, Canon indexes every file it finds as a source. Each source represents a specific file at a specific path.
Objects Are Content
When sources are hashed, Canon creates or links them to objects. An object represents the underlying content, independent of where it was found.
Source A: /backup1/photos/IMG_001.jpg ─┐
Source B: /backup2/old/IMG_001.jpg ─┼─► Object (hash: abc123...)
Source C: /downloads/photo.jpg ─┘
All three sources above have identical content, so they reference the same object.
Fact Sharing
When a source is linked to an object:
- Content facts (like EXIF metadata) can be stored on the object and become available to all sources with that hash
- Source facts (like file path) remain specific to each source
This allows metadata to flow between different copies of the same content. Import a fact once, and it’s available everywhere that content exists.
Archive Tracking
Canon uses the source-object relationship to track archiving progress:
- When you archive a file, Canon copies it to an archive root and records the object’s hash
- Any source with that same hash is now considered “archived”
- The
coveragecommand shows how many of your sources exist in an archive
Hashing
By default, Canon hashes all files during scanning. Since hashing can be time-consuming for large collections, you can:
- Use
--no-hashduring scan to skip hashing initially - Hash selectively via the enrichment pipeline, targeting specific file types
Unhashed sources cannot be linked to objects, so they cannot be deduplicated or tracked for archive coverage.
Facts
Facts are key-value metadata attached to sources or objects.
Types of Facts
Built-in facts are collected automatically during scanning:
source.ext- File extensionsource.size- File size in bytessource.mtime- Modification timestampcontent.hash.sha256- Content hash (when computed)
Imported facts come from external tools via the enrichment pipeline:
- EXIF metadata:
content.Make,content.Model,content.DateTimeOriginal - Geolocation:
content.geo.city,content.geo.country - Media info:
content.mime,content.duration - Any custom key-value pairs you choose to import
Namespaces
Facts are namespaced:
source.*- Facts about the file on disk (path, size, timestamps)content.*- Facts about the content itself (stored on objects when hashed)
When querying, the content. prefix is optional: --where 'Make=Apple' is equivalent to --where 'content.Make=Apple'.
Value Types
Canon stores facts as:
- Text: Strings like
"Apple"or"image/jpeg" - Numbers: Integers or decimals like
1024or3.14 - Timestamps: Unix timestamps, enabling date modifiers like
|yearand|month
Type hints can be provided during import to ensure correct parsing. See Enriching for details.
Canon Commands
Common Options
Most commands that operate on sources share these options:
Path scope — Limit a command to a specific directory by passing a path:
canon ls /path/to/photos
canon facts /path/to/photos
canon coverage /path/to/photos
Filters — Select sources using --where with boolean expressions:
canon ls --where 'source.ext=jpg'
canon facts --where 'source.size > 1000000'
canon cluster generate --where 'geo.country=Netherlands' --dest /archive
Multiple --where flags are combined with AND. See Filters for the full syntax.
--include — By default, query commands (ls, facts, coverage, worklist, compare) show sources from active source roots, hiding excluded and archived sources. Use --include to expand what you see:
canon ls --include excluded # Also show excluded sources
canon ls --include archived # Also show sources from archive roots
canon facts --include all # Show everything
This is always safe — --include only changes what’s displayed, never modifies anything.
--allow — Commands that change state (cluster generate, apply, import-facts) skip certain sources by default (e.g., sources already in an archive). Use --allow to acknowledge you want to include them:
canon cluster generate --allow archived # Include sources from archive roots
canon cluster generate --allow duplicates # Include content already archived elsewhere
canon import-facts --allow archived # Import facts for archive sources
The available --allow values are specific to each command. See individual command pages for details.
Command Reference
- Managing Roots: Add and manage storage locations
- Enriching: Import metadata from external tools
- worklist: Output sources for external processing
- import-facts: Import processor output
- Writing Processors: Build custom extractors
- Querying: Explore your indexed files
- Managing Sources: Control which sources are processed
- exclude: Mark sources to skip during archiving
- Archiving: Organize files into your canonical archive
- Maintenance: Clean up and maintain the database
- facts delete: Remove incorrect or unwanted metadata
- prune: Clean up stale, orphaned, or excluded data
Managing Roots
To track files in Canon, first you add and scan roots. This makes these sources available for further enrichment or archive operations. You can suspend roots to temporarily mask them from Canon commands.
Adding new roots, or scanning existing is performed through the scan command.
Managing roots, such as suspending or listing them is done with canon roots.
Scan
Scan directories and index files.
When you scan a particular root, Canon will walk the directory tree starting at the given path(s). For each file, basic metadata such as last modification time and size is collected, and (by default) the hash is computed. After scanning, Canon knows about the existence of all sources in that root. If the files were hashed they will be linked to objects.
The hashing process can take quite long, so it is possible to skip that (--no-hash).
Not hashing is an option if your intention is to hash selectively, for instance: you’re only interested in certain types of files.
There is no real limit on how many roots you can add. It may be helpful to scan collections of files that belong together as separate roots. Each root can be given a comment, so this can help you recall what is contained, but you can also use this to store some notes about what you discovered in these roots.
If you have an already organized location that you want Canon to treat as your canonical archive, scan it with --role archive from the start. The role is set when the root is added; to change it, you must remove the root and re-add it with the new role.
You can add multiple archive roots, for instance one for your music collection and another for your eBooks.
When to run scan
If your filesystem changes regularly, make sure to re-scan your roots with Canon. That way Canon can detect change, and you will not miss files for archiving. Note that, when archiving, Canon always checks the validity of the files to be archived.
Another use case is periodic integrity verification of your archives. Use --verify to recompute hashes for all files and detect corruption. Canon exits with a non-zero status if any mismatches are found, making it suitable for cron jobs that alert on failure.
Examples
# Add a new root and scan it (--add and --role required for new roots)
canon scan --add --role source /path/to/photos
# Scan multiple new roots
canon scan --add --role source /path/to/photos /path/to/more/photos
# Add with a descriptive comment
canon scan --add --role source --comment "Photos from 2020 trip" /path/to/photos
# Add as an archive root (for tracking already-organized files)
canon scan --add --role archive /path/to/archive
# Re-scan an existing root (--role optional, validated against existing)
canon scan /path/to/photos
# Scan just a subtree within an existing root
canon scan /path/to/photos/2024
# Scan without computing hashes (just index files)
canon scan --no-hash /path/to/photos
# Verify archive integrity by recomputing all hashes (good for cron jobs)
canon scan --verify /Volumes/Archive
# Mark sources under a deleted folder as not present
canon scan --missing /path/to/deleted/folder
Hash computation: By default, Canon computes content hashes for new and changed files during scan. This enables deduplication and archive tracking. Use --no-hash to skip hashing if you just want to index files quickly.
Integrity verification: Use --verify to recompute hashes for all files, even unchanged ones. Run periodically (e.g., via cron) to detect file corruption. If a file’s hash changes without its mtime changing, Canon warns about possible corruption and exits with an error.
Discovering untracked directories: Use --candidates to find directories with files that aren’t yet under any root. This is useful when exploring a drive or backup to see what could be added:
# Find candidate roots to add under a path
canon scan --candidates /Volumes/Backup
# Output shows directories with untracked files
Candidate roots to add:
/Volumes/Backup/photos (3 directories with files)
/Volumes/Backup/imports (1 directory with files)
Directories under existing roots are skipped. When multiple subdirectories share a common ancestor that could be added as a single root, they’re rolled up (unless that ancestor contains an existing root).
Marking deleted paths as missing: When you delete a folder that was under a scanned root, Canon still thinks those files are present. Normally you’d re-scan the parent to let Canon discover they’re gone, but that can be expensive if the parent contains many other files. Use --missing to tell Canon directly that a path no longer exists:
# Deleted a backup folder — mark its 140 sources as not present
canon scan --missing /Volumes/share/Backup/old-phone
# Works with any path under a known root, including the root itself
canon scan --missing /Volumes/share/Backup
The sources are marked as not present but remain in the database with their hashes and metadata intact. If the path reappears later (e.g., storage remounted), a normal scan will reconcile them back. Cannot be combined with --all or --add.
Output shows what was found:
Scanned 1234 files: 100 new, 5 updated, 2 moved, 1127 unchanged, 0 missing
Hashed 105 files
canon roots
List and manage registered roots.
Roots are added via scan and managed with the roots command. You can list, suspend/unsuspend, add comments, or remove roots.
Important notes:
- Removing a root also removes its sources and attached facts from the database
- Removing a root does not delete any files on disk
- If you re-add a removed root, you’ll need to re-enrich it
# List all roots with file counts and last scan time
canon roots
# List roots at or beneath a specific path
canon roots /path/to/photos
# List only suspended roots
canon roots --suspended
# Set a comment on a root (omit text to clear)
canon roots comment id:1 "Old backup, possibly duplicates"
canon roots comment id:1
# Suspend a root (hides from all operations without deleting data)
canon roots suspend id:1
canon roots suspend path:/path/to/photos
# Unsuspend a root (make visible again)
canon roots unsuspend id:1
# Remove a root by ID (files on disk are NOT deleted)
canon roots rm id:1
# Remove a root by path
canon roots rm path:/path/to/photos
# Skip confirmation prompt
canon roots rm id:1 --yes
Example output:
ID ROLE FILES LAST SCAN PATH
1 source 16635 2h ago /path/to/photos
2 archive 169941 5d ago /path/to/archive
3 source 1234 never /path/to/backup (Old backup, possibly duplicates)
Suspending Roots
Suspended roots are hidden from listings, excluded from scan --all, and their sources are excluded from all queries (ls, facts, coverage, worklist, etc.). Suspended roots still prevent overlapping (you cannot add a new root at a suspended root’s path). Use --suspended to list only suspended roots.
Removing Roots
When removing a root, Canon shows how many sources are “in archive” (same content exists in an archive) vs “not in archive”, and suggests using canon ls <path> to preview which sources will be forgotten.
Root Specs
Several commands accept root specifications in two formats:
| Format | Example | Description |
|---|---|---|
id:N | id:1 | By database ID (shown in canon roots output) |
path:/... | path:/path/to/photos | By exact path |
canon roots suspend id:1
canon roots suspend path:/path/to/photos
Enriching
Add metadata to indexed files using external processors.
Canon uses a pipeline model: worklist outputs sources as JSONL, an external processor extracts metadata, then import-facts stores the results.
canon worklist → processor → canon import-facts
A processor can be any CLI tool or script that extracts information from files: exiftool for EXIF data, file for MIME types, ffprobe for media info, or custom scripts you write yourself.
Basic Usage
Extract EXIF metadata from images:
canon worklist --where 'source.ext|lowercase IN (jpg, jpeg, heic)' \
| ./scripts/exif-worklist.sh \
| canon import-facts
Note the --where filter: it’s usually smart to limit the worklist to files the processor can actually handle.
Detect MIME types for all files:
canon worklist | canonargs --fact mime -- file -b --mime-type {} | canon import-facts
After enrichment, the imported facts become available for filtering and querying.
Provided Processors
Canon includes ready-to-use processors:
| Processor | Purpose | Requires |
|---|---|---|
scripts/exif-worklist.sh | EXIF, GPS, and media metadata | exiftool, jq |
scripts/hash-worklist.sh | SHA-256 content hashes | jq |
canonargs --fact mime -- file -b --mime-type {} | MIME type detection | canonargs |
Install canonargs with: cargo install canonargs
Going Deeper
worklist- Full options for generating worklistsimport-facts- Input format and type hints- Writing Processors - Build your own enrichment scripts
Tip: Selective Hashing
Content hashing normally happens during scan. If you prefer to hash only specific file types, use --no-hash during scan and hash selectively via the pipeline:
canon scan --no-hash --add --role source /path/to/mixed-files
canon worklist --where 'mime~image/* OR mime~video/*' \
| ./scripts/hash-worklist.sh \
| canon import-facts
canon worklist
Output sources as JSONL for processing by external tools.
# Sources in current directory (when inside a root)
canon worklist
# All sources across all roots
canon worklist --global
# Only sources missing a content hash
canon worklist --where 'NOT content.hash.sha256?'
# Only JPG files
canon worklist --where 'source.ext=jpg'
# Scope to a specific directory
canon worklist /path/to/photos
# Include sources from archive roots (for backfilling facts)
canon worklist --include archived
# Include excluded sources
canon worklist --include excluded
# Include both
canon worklist --include all
# Include existing facts in output (for chained enrichment)
canon worklist --emit content.geo.lat --emit content.geo.lon
Output Format
Each line is a JSON object with source metadata:
{"source_id":123,"path":"/full/path/to/file.jpg","root_id":1,"size":1024,"mtime":1703980800,"basis_rev":0}
| Field | Description |
|---|---|
source_id | Database ID (pass through to import-facts) |
path | Full absolute path to the file |
root_id | ID of the root containing this source |
size | File size in bytes |
mtime | Modification time (Unix timestamp) |
basis_rev | Revision counter for staleness detection |
Emitting Existing Facts
With --emit, requested facts are included in the output (null if absent):
canon worklist --emit geo.lat --emit geo.lon
{"source_id":123,"path":"/...","basis_rev":0,"facts":{"geo.lat":52.37,"geo.lon":4.89}}
{"source_id":124,"path":"/...","basis_rev":0,"facts":{"geo.lat":null,"geo.lon":null}}
This enables processors to build on previous enrichment:
- Dependent enrichment: Use extracted coordinates to look up location names
- Fact combination: Merge data from multiple sources into derived facts
Example: reverse geocoding files that have coordinates but no city name:
canon worklist --emit geo.lat --emit geo.lon --where 'geo.lat? AND NOT geo.city?' \
| ./scripts/reverse-geocode.sh \
| canon import-facts
Staleness Detection
The worklist is a snapshot of sources at a point in time. Each entry includes basis_rev which tracks file changes. Processors should pass this through to import-facts, which will skip the import if the file changed since the worklist was generated.
The size and mtime fields allow processors to verify a file hasn’t changed before extracting facts.
canon import-facts
Import facts from JSONL on stdin. Designed to receive output from a processor that consumed a worklist.
canon worklist | some-processor | canon import-facts
# Allow importing facts for sources in archive roots
canon worklist --include archived | some-processor | canon import-facts --allow archived
Input Format
Each line must be a JSON object with source_id, basis_rev, and facts:
{"source_id":123,"basis_rev":0,"facts":{"hash.sha256":"abc123...","mime":"image/jpeg"}}
| Field | Description |
|---|---|
source_id | Source ID from the worklist (required) |
basis_rev | Revision from the worklist for staleness check (required) |
facts | Object mapping fact keys to values |
The processor must pass through source_id and basis_rev from the worklist entry. If basis_rev doesn’t match the source’s current value, the import is skipped (the file changed since the worklist was generated).
Fact Namespacing
Facts are automatically namespaced under content.*. For example, mime becomes content.mime.
The special key hash.sha256 creates or links an object, enabling deduplication and archive tracking.
Type Hints
Types matter. Canon stores facts as text, numbers, or timestamps. The type determines what operations work on a fact:
- Timestamps enable date modifiers (
|year,|month,|date) and date comparisons (>=2024-01-01) - Numbers enable numeric comparisons (
>1000,<=5.0) and the|bucketmodifier - Text enables string matching (
=,~glob) and string modifiers (|lowercase,|stem)
If a datetime like "2024:07:23 11:06:32" is stored as text instead of a timestamp, queries like --where 'DateTimeOriginal|year=2024' won’t work—the modifier expects a timestamp, not a string.
Providing Type Hints
Wrap values in an object with value and type:
{"source_id":123,"basis_rev":0,"facts":{
"DateTimeOriginal": {"value": "2024:07:23 11:06:32", "type": "datetime"},
"duration": {"value": "1:23:45", "type": "duration"},
"rating": 5
}}
| Type | Parses | Stored As |
|---|---|---|
datetime | ISO dates, EXIF format, plain years (2024) | Unix timestamp |
duration | "1:23:45", "5:30", or seconds as number | Seconds (number) |
| (none) | Strings as text, numbers as numbers | As-is |
Common Pitfalls
Dates as strings: EXIF dates from tools like exiftool come as strings ("2024:07:23 11:06:32"). Without a type hint, they’re stored as text and time modifiers won’t work. Always use "type": "datetime" for date fields.
Mixed types: A fact key must have a consistent type across all sources. You cannot store DateTimeOriginal as text for some files and as a timestamp for others. If you initially imported facts with the wrong type and need to re-import with the correct type, first delete the existing entries:
# Delete all DateTimeOriginal facts that were stored as text
canon facts delete --key content.DateTimeOriginal --type text
Then re-run your processor with proper type hints.
Archive Sources
By default, importing facts for sources in archive roots is skipped. Use --allow archived to enable this (useful for backfilling metadata on already-archived files).
Writing Processors
Processors are scripts or programs that read worklist entries, extract metadata from files, and output facts for import.
Input and Output
A processor reads JSONL from worklist and writes JSONL for import-facts.
Input (from worklist):
{"source_id":123,"path":"/photos/IMG_001.jpg","basis_rev":0,"size":1024,"mtime":1703980800}
Output (for import-facts):
{"source_id":123,"basis_rev":0,"facts":{"Make":"Apple","Model":"iPhone 12"}}
The processor must pass through source_id and basis_rev unchanged.
Custom Processors
Read JSONL from stdin, extract facts from each file, output JSONL to stdout:
#!/bin/bash
while IFS= read -r line; do
source_id=$(echo "$line" | jq -r '.source_id')
basis_rev=$(echo "$line" | jq -r '.basis_rev')
path=$(echo "$line" | jq -r '.path')
# Extract facts (example: EXIF data)
facts=$(exiftool -json -Make -Model "$path" 2>/dev/null | jq '.[0]')
jq -nc \
--argjson source_id "$source_id" \
--argjson basis_rev "$basis_rev" \
--argjson facts "$facts" \
'{source_id: $source_id, basis_rev: $basis_rev, facts: $facts}'
done
The canonargs Helper
If you don’t want to handle JSONL parsing and output formatting yourself, canonargs takes care of that. You only provide a command that extracts data from a single file.
Installation
cargo install canonargs
Single Fact Mode
When your command outputs a single value:
canon worklist | canonargs --fact mime -- file -b --mime-type {} | canon import-facts
The {} is replaced with the file path. The command’s stdout becomes the fact value.
Default behavior: Values are stored as text. To specify a type, add --type:
# Store as datetime (enables |year, |month modifiers)
canon worklist | canonargs --fact DateTimeOriginal --type datetime -- exiftool -DateTimeOriginal -s3 {} | canon import-facts
# Store image width as number (using ImageMagick's identify)
canon worklist | canonargs --fact width --type number -- identify -format '%w' {} | canon import-facts
Valid types: datetime, duration, number
Key-Value Mode
When your command outputs key=value pairs (one per line):
canon worklist | canonargs --kv -- my-extractor {} | canon import-facts
Default behavior: All values are stored as text. To specify types, use key:type=value syntax:
width:number=1920
height:number=1080
DateTimeOriginal:datetime=2024:07:23 14:30:00
codec=h264
JSON Mode
When your command outputs a JSON object:
canon worklist | canonargs --json -- exiftool -json {} | canon import-facts
Example extractor output:
{"Make": "Apple", "Model": "iPhone 12", "DateTimeOriginal": "2024:07:23 14:30:00"}
JSON mode auto-detects numbers. If your command outputs "width": 1920 (a JSON number), it’s stored as a number. If it outputs "width": "1920" (a quoted string), it’s stored as text.
For datetime fields, you still need to use the typed hint format:
{"DateTimeOriginal": {"value": "2024:07:23 14:30:00", "type": "datetime"}}
Chaining
Processors can be chained since canonargs passes through the worklist entry and merges facts:
canon worklist \
| canonargs --fact mime -- file -b --mime-type {} \
| canonargs --json -- exiftool -json {} \
| canon import-facts
Using Existing Facts
Processors can access previously imported facts via the --emit flag on worklist. See Emitting Existing Facts for details.
Type Hints
Important: The type of a fact determines what operations work on it:
- Timestamps enable
|year,|monthmodifiers and date comparisons (>=2024-01-01) - Numbers enable numeric comparisons (
>1000) and|bucketmodifier - Text enables string matching and
|lowercase,|stemmodifiers
If your processor outputs dates as strings or numbers as strings, add type hints:
{"source_id":123,"basis_rev":0,"facts":{
"DateTimeOriginal": {"value": "2024:07:23 11:06:32", "type": "datetime"},
"duration": {"value": "1:23:45", "type": "duration"},
"width": 1920
}}
Without "type": "datetime", a date string like "2024:07:23 11:06:32" is stored as text and --where 'DateTimeOriginal|year=2024' won’t work.
Numbers from JSON are automatically stored as numbers. But if your extractor outputs "width": "1920" (a string), numeric comparisons like --where 'width>1000' won’t work as expected.
See import-facts for full details.
Tips
- Always pass through
source_idandbasis_revunchanged - Use
jq -cfor compact JSON output (one object per line) - Handle errors gracefully—skip files that can’t be processed
- Use type hints for datetime fields so modifiers work correctly
- Ensure numbers are actual JSON numbers, not quoted strings
Querying
After scanning and enriching, you can explore your indexed files.
ls- List sources matching filter expressionsfacts- Discover available facts and check coveragecompare- Compare directories to find overlapsurvey- Survey a selection for archive status, related locations, and unique content
All query commands support path scoping (limit to a subdirectory) and --where filters.
Scope defaulting: When no paths are given, query commands scope to the current directory if it’s inside a known root. If the current directory is not inside any root, commands operate globally across all roots. Use --global to force global scope while inside a root.
canon ls
List sources matching filters. Useful for quick inspection and piping to other tools.
# List sources in current directory (default when inside a root)
canon ls
# List sources matching a filter
canon ls --where 'source.ext=jpg'
# Filter by source ID
canon ls --where 'source.id=12345'
# List only archived sources (content exists in an archive)
canon ls --archived
# List archived sources with their archive location(s)
# Output: source_path<TAB>archive_path (one line per archive location)
canon ls --archived=show
# List only unarchived sources (hashed but not in any archive)
canon ls --unarchived
# List only unhashed sources (no content hash yet)
canon ls --unhashed
# Show duplicate files (same content hash), grouped by hash
canon ls --duplicates
# Show only excluded sources (source-level and object-level)
canon ls --excluded
# Include sources from archive roots (automatic when scope is in an archive)
canon ls --include archived
# Include excluded sources in results
canon ls --include excluded
# Include both archived and excluded sources
canon ls --include all
# Query all roots, ignoring current directory scope
canon ls --global --where 'source.ext=jpg'
# Long format with size and date
canon ls -l
# Null-delimited output for xargs (handles spaces in paths, macOS)
canon ls -0 --where 'source.ext=jpg' | xargs -0 open -a Preview
Filter modes (--archived, --unarchived, --unhashed, --duplicates, --excluded) are mutually exclusive – only one can be active at a time.
Status column in long format: When --include is used or --excluded mode is active, ls -l shows a status column indicating source state: E (source-level exclusion), X (object-level exclusion), A (archived), or blank.
Scope display: When scoped (via CWD or explicit path), ls prints scope: /path to stderr so you always know what you’re looking at. When global, no scope line is printed.
Path display:
- CWD-scoped (no explicit path, inside a root) → relative output paths
- Explicit absolute path or
--global→ absolute output paths
Output is one path per line (stdout), with a count printed to stderr:
scope: /Volumes/old-drive/photos
vacation/img001.jpg
vacation/img002.jpg
work/doc.pdf
3 sources
canon facts
Discover what metadata you have and check coverage.
# Overview of all facts (scoped to current directory when inside a root)
canon facts
# Scoped to a specific directory
canon facts /path/to/photos
# Global overview across all roots
canon facts --global
# With filters
canon facts --where 'source.ext=jpg'
# Value distribution for a specific fact
canon facts --key content.Make
# With modifiers: group mtime by year-month
canon facts --key source.mtime|yearmonth
# With accessors: distribution by top-level directory
canon facts --key source.rel_path[0]
# Combine accessor and modifier: distribution by filename extension
canon facts --key source.rel_path[-1]|ext
# Show hidden built-in facts
canon facts --all
# Unlimited results (default is 50)
canon facts --key content.hash.sha256 --limit 0
# Include sources from archive roots
canon facts --include archived
# Include excluded sources
canon facts --include excluded
# Include both
canon facts --include all
# Show source count per root (which roots have matching content?)
canon facts --by-root
canon facts --where '@image' --by-root
# Group fact values by root (which roots contribute to each value?)
canon facts --key source.ext --by-root
# Group by any fact key (with modifiers)
canon facts --key source.ext --group-by 'source.mtime|year'
# Compound grouping (root + another fact)
canon facts --key source.ext --by-root --group-by 'content.Make'
The output begins with a scope header showing what’s being queried (Facts: /path or Facts: all roots).
Example output:
Facts: all roots
Sources matching filters: 34692
Fact Count Coverage
────────────────────────────────────────────────────
source.ext 34692 100.0% (built-in)
source.size 34692 100.0% (built-in)
source.mtime 34692 100.0% (built-in)
source.path 34692 100.0% (built-in)
content.hash.sha256 34692 100.0%
content.mime 34692 100.0%
content.Model 7935 22.9%
content.Make 7935 22.9%
...
Example grouped output (--by-root):
source.ext (by root)
jpg (total: 12,500, 36.0%)
id:1 ...stack/Backup/Pictures 8,000 64.0%
id:2 ...castor-import/gringo 4,500 36.0%
png (total: 8,200, 23.6%)
id:1 ...stack/Backup/Pictures 5,000 61.0%
id:3 ...castor-import/hydra 3,200 39.0%
See also: facts delete for removing incorrect metadata, prune for cleaning up stale or orphaned data.
canon compare
Compare two folders by content hash. Useful for verifying backups or finding differences between directories.
# Compare current directory against another location
canon compare /path/to/folder_b
# Compare two explicit directories
canon compare /path/to/folder_a /path/to/folder_b
# With filters
canon compare /path/to/folder_a /path/to/folder_b --where 'source.ext=jpg'
# Include excluded sources in comparison
canon compare /path/to/folder_a /path/to/folder_b --include excluded
# Show file paths for differences
canon compare /path/to/folder_a /path/to/folder_b --verbose
With one path argument, the current directory is used as side A and the argument as side B. With two paths, they are used as A and B explicitly. The current directory must be inside a known root when used as side A.
Output shows:
- Files only in A (by content)
- Files only in B (by content)
- Files in both (matching content hash)
Exit code is 0 if identical, 1 if differences found.
canon survey
Survey a location to understand what’s here, where it connects, and what’s unique. The default output is an orientation map — what’s archived, which other locations share content, and how much exists only here. Use it as a starting point when arriving at a new folder, an old drive, or any scope you want to understand.
# Survey current directory
canon survey
# Survey a specific path
canon survey /mnt/old-drive/photos
# Survey with filters
canon survey /mnt/old-drive/photos --where "@image AND source.mtime|year=2016"
# See which of your files overlap with related locations
canon survey /mnt/old-drive/photos --detail overlap
# See content that exists nowhere else
canon survey /mnt/old-drive/photos --detail unique
# Pipe unique paths for further processing
canon survey /mnt/old-drive/photos --detail unique -0 | xargs -0 open
# See what's NOT at a reference location
canon survey /mnt/old-drive/photos --detail residual --other /mnt/backup/vacation/
# Add affinity columns to understand related locations deeper (requires --where)
canon survey /mnt/old-drive/photos --where "@image" --affinity
# See complementary content at related locations (requires --where)
canon survey /mnt/old-drive/photos --where "@image" --detail complement
# Compare against specific locations instead of discovering them
canon survey /mnt/old-drive/photos --other /mnt/backup/vacation/
# Filter archive section to a specific archive
canon survey /mnt/old-drive/photos --archive path:/archive/photos
# Include excluded sources in the selection
canon survey /mnt/old-drive/photos --include excluded
# Survey all roots globally (when inside a root but want the full picture)
canon survey --global
Options
| Flag | Description |
|---|---|
--where <EXPR> | Filter expression (repeatable). Narrows the selection. |
--affinity | Enable affinity columns (+N more, unique count, classification). Requires --where. |
--detail <MODE> | complement, unique, overlap, or residual. Replaces the summary view. |
--archive <SPEC> | Filter archive section to a specific archive root (id:N or path:/...). |
--include <VALUE> | Include additional sources: excluded. |
--global | Survey all roots, ignoring current directory scope. |
--other <PATH> | Compare against specific locations (repeatable). Bypasses scope discovery. |
--brief | Skip per-location affinity computation when --affinity is active. |
--verbose | Show all locations (summary) or all paths per location (detail views). |
-0 | Null-delimited output for --detail unique, --detail overlap, or --detail residual. |
Reading the output
Summary view (default)
The default output is an orientation view — designed to help you understand the character of a place before deciding what to do next.
Survey: /mnt/old-drive/exports
517 sources here (0 unhashed, 517 hashed)
264 unique here
Archived: 201 of 517 (38.9%)
/archive/media/2019/holiday 41
/archive/media/2019/kids 35
/archive/media/2019/home 43
/archive/media/2020/kids 22
...
Related locations:
/mnt/backup/pictures/phone/ 161 of 517 overlap (18,057 total)
/mnt/sandisk-export/camera-roll/2019/dec 82 of 517 overlap (370 total)
/mnt/sandisk-export/camera-roll/2020/jan 40 of 517 overlap (211 total)
/mnt/backup/phone/2019-W48 37 of 517 overlap (115 total)
... and 6 more locations (use --verbose to show all)
The output has three sections:
Survey header: Shows your scope, any active filters, and source counts. The unhashed/hashed split tells you how many files can participate in content comparison — unhashed files can’t be matched. “Unique here” is the count of content that exists nowhere else in Canon’s universe.
Archived: How many of your files have copies in an archive. The archive paths show where in the archive this content lives — the path names often reveal what past-you was thinking when you archived it. Use --detail overlap --other <archive-path> to see which specific files are archived at a given location.
Related locations: Other places in Canon’s universe that share content with your selection. Each line shows:
- N of M overlap: How many of your files also exist at this location
- (T total): How many files are at this location overall — this tells you the location’s scale relative to the overlap
Use --detail overlap to see which of your files appear at each location. Locations are sorted by overlap count, highest first.
Adding filters
The summary works without any --where filters — it shows the full character of a location. Filters narrow what you’re looking at:
# What's the story for just the images here?
canon survey /mnt/old-drive/exports --where "@image"
# What about content from a specific period?
canon survey /mnt/old-drive/exports --where "source.mtime|year=2019"
The same related locations may appear with different overlap counts, because the overlap is computed against your filtered selection.
Affinity mode (--affinity)
When you have a --where filter and want to understand what related locations have beyond the overlap, --affinity adds classification columns:
Related locations:
/mnt/backup-2022/photos/italy/ ≥ 380 of 388 overlap (420 total) +95 more (31 unique)
/mnt/partner-laptop/DCIM/vacation > 45 of 388 overlap (225 total) +180 more (42 unique)
/mnt/backup-2022/photos/misc/ ⊆ 30 of 388 overlap (30 total)
The additional columns:
- +N more: Files at this location that match your filters but have different content from your selection — what you’d find if you went there
- (K unique): Of those, how many exist nowhere else
- Classification symbol: How this location relates to your selection (see below)
The four dispositions
With --affinity, each related location is classified:
- Superset (≥) — Has nearly everything you have, plus more matching content. A more complete version of what you’re looking at.
- Lead (>) — Has complementary content with partial overlap. A related collection with additional material.
- Subset (⊆) — High overlap, no complementary content, and most of the location’s own content overlaps with yours. A smaller copy.
- Mirror (=) — Overlap but no complementary content, and the location has significant other content outside your filter. A partial copy within a larger collection.
Locations are sorted by classification: supersets first, then leads, then subsets, then mirrors. Within each group, sorted by complementary count descending, then overlap count descending.
Detail views
Detail views replace the summary with specific file listings. They answer the “show me” questions that arise from reading the summary.
| Summary signal | Question | Detail view |
|---|---|---|
| “201 of 517 archived (38.9%)” | Which files are archived, and where? | --detail archived |
| “264 unique here” | What content exists only here? | --detail unique |
| “161 of 517 overlap” | Which of my files are at that location? | --detail overlap |
| “+95 more” (affinity) | What matching content is over there? | --detail complement |
| — | What’s here that’s NOT at a specific location? | --detail residual |
Archived (--detail archived)
Shows which of your files are archived, grouped by archive location, with counterpart paths showing where each file lives in the archive:
Archived files (201 sources across 6 locations):
Archived at /archive/media/2019/home (43 files):
exports/photos/IMG_0001.jpg
→ media/2019/home/IMG_0001.jpg
exports/photos/IMG_0002.jpg
→ media/2019/home/IMG_0002.jpg
... and 38 more
Archived at /archive/media/2019/holiday (41 files):
exports/vacation/DSC_0100.jpg
→ media/2019/holiday/DSC_0100.jpg
...
Locations are sorted by file count (most files first). When results are small (20 or fewer per location), all paths are shown; otherwise capped at 5. Use --verbose to see all. With -0, output is flat, deduplicated selection-side paths only (for piping to xargs -0).
Use --archive to filter to a specific archive root.
Unique (--detail unique)
Outputs paths of files whose content exists nowhere else:
photos/2016-07-14/IMG_4201.jpg
photos/2016-07-14/IMG_4202.jpg
photos/2016-07-18/DSC_0891.jpg
Paths are relative when the scope is under the current directory, absolute otherwise. Use -0 for null-delimited absolute paths (for xargs -0).
Overlap (--detail overlap)
Shows which of your files have copies at each related location, along with the counterpart paths at that location:
Overlapping with related locations (overlap):
/mnt/backup/phone-export/ (4 of 135 overlap):
recordings/morning-walk.m4a
→ audio/2020/morning-walk.m4a
recordings/evening-notes.m4a
→ audio/misc/recording-001.mp3
photos/IMG_0042.JPG
→ DCIM/2020-W48/IMG_0042.JPG
→ DCIM/2020-W48/IMG_0042 2.JPG
Each → line shows where the matching content lives at the other location. Multiple counterparts appear when the same content exists more than once (e.g., OS-generated duplicates like IMG_0042 2.JPG). Counterpart paths are relative to the location.
When results are small (20 or fewer), all paths are shown. For larger results, paths are capped at 5 per location; use --verbose to see all. With -0, output is flat and deduplicated selection-side paths only (no counterpart data) — for piping to xargs -0.
Complement (--detail complement)
Requires --where. Shows files at related locations that match your filters but have different content from your selection. Implies affinity computation.
Complementary content at related locations:
/mnt/backup-2022/photos/italy/ (+95, 31 unique):
week3/IMG_4501.jpg
week3/IMG_4502.jpg
week3/IMG_4503.jpg
week4/IMG_4601.jpg
week4/IMG_4602.jpg
... and 90 more
Paths are relative to the location. When results are small (20 or fewer), all paths are shown; otherwise capped at 5 per location. Use --verbose to see all.
Residual (--detail residual)
Requires --other. Shows which of your files are NOT shared with the reference location:
Not at /mnt/backup/vacation/ (residual):
photos/IMG_4201.jpg
photos/IMG_4202.jpg
photos/IMG_4203.raw
Unhashed files are always included in residual output — without a hash, their presence at the reference location can’t be confirmed. Use -0 for flat output. With multiple --other locations, each gets a separate listing.
Directed comparison (--other)
By default, survey discovers related locations by searching Canon’s full universe for content overlap. --other lets you specify locations directly:
canon survey /mnt/old-drive/photos \
--other /mnt/backup/vacation_italy/ \
--other /mnt/partner-laptop/DCIM/
Differences from default mode:
- Header reads “Comparing with:” instead of “Related locations:”
- Locations are displayed in user-specified order (not sorted)
- In
--detail complement, mirrors are shown with a note rather than omitted
Archive status and unique counts are always computed against the full universe regardless of --other.
How exploration typically flows
Survey supports a non-linear exploration style. You might follow any of these paths depending on what the summary reveals:
Orientation: Arrive at a location, survey it, read the landscape. The archive paths and location names often tell you what a place is — a phone backup, a project folder, a parking dump. From here you might scope down to a subfolder, add --where filters, or drill into a detail view.
Following a thread: A related location catches your eye. Survey it directly (canon survey <that-path>) to understand it. Check --detail overlap to see which files connect the two places. Use canon facts to understand what metadata is available, then refine with --where.
Assessing coverage: Use --affinity with a --where filter to see which locations have more matching content. Drill into --detail complement to see what’s there. Use --detail residual --other <location> to see what’s not covered.
Acting on results: Pipe --detail unique -0 or --detail overlap -0 to downstream tools — xargs -0 open for inspection, xargs -0 ls -la for sizes, or further processing when you’re ready.
Managing Sources
After scanning and enriching, you may want to control which sources are included in archiving operations.
The exclude command lets you mark sources to skip during cluster generate and apply. This is useful for:
- Ignoring temporary or system files
- Skipping known duplicates while keeping a preferred copy
- Filtering out small files below a size threshold
- Removing unwanted files from consideration without deleting them
Exclusions are stored directly on sources and can be cleared at any time.
canon exclude
Manage source exclusions. Excluded sources are skipped by most commands.
# Mark sources as excluded (e.g., small files, temp files)
canon exclude set --where 'source.size<1000'
canon exclude set /path/to/photos --where 'source.ext=tmp'
# Exclude a specific file by path
canon exclude set /path/to/photos/unwanted.jpg
# Exclude by source ID (shown in ls --duplicates output)
canon exclude set --id 12345
# Preview what would be excluded
canon exclude set --where 'source.ext=bak' --dry-run
# Skip confirmation prompt (for scripting)
canon exclude set --where 'source.ext=bak' --yes
# View excluded sources (use ls --excluded instead of the removed exclude list)
canon ls --excluded
canon ls --excluded /path/to/photos
# Remove exclusions
canon exclude clear
canon exclude clear --where 'source.ext=tmp'
# Preview what would be cleared
canon exclude clear --where 'source.ext=tmp' --dry-run
# Skip confirmation prompt
canon exclude clear --yes
When excluding or clearing more than one source, a confirmation prompt shows the count, root spread, and (for exclude set) archive coverage before proceeding. Use --yes to skip the prompt, or --dry-run to preview without executing.
canon exclude duplicates
Automatically exclude duplicate files while keeping copies in a preferred location.
# Exclude duplicates, keeping files under /preferred/path
canon exclude duplicates /scope/path --prefer /preferred/path
# Preview what would be excluded
canon exclude duplicates /scope/path --prefer /preferred/path --dry-run
# Skip confirmation prompt
canon exclude duplicates /scope/path --prefer /preferred/path --yes
# With filters
canon exclude duplicates /scope/path --prefer /preferred/path --where 'source.ext=jpg'
This is useful for deduplicating across backup drives while keeping the “canonical” copy in your preferred location.
When excluding more than one source, a confirmation prompt shows the count, number of duplicate groups, and skip statistics before proceeding. Use --yes to skip the prompt.
How exclusions affect other commands:
| Command | Default behavior | Override |
|---|---|---|
ls | Skips excluded | --include excluded or --excluded filter mode |
worklist | Skips excluded | --include excluded |
facts | Skips excluded, shows count | --include excluded |
coverage | Stats on included only | --include excluded shows excluded dimension |
cluster generate | Always skips excluded | No override (hard gate) |
apply | Blocks if manifest has excluded | No override (hard gate) |
Exclusions are stored directly on sources and objects in the database.
Archiving
When you find a collection of files to archive, Canon uses a two-step process:
- Generate a manifest with
cluster- select files and define the destination - Apply the manifest with
apply- copy or move files to the archive
This workflow lets you review and customize the output before committing to any file operations.
coverage- Check how much has been archivedcluster- Generate a manifest for a set of filesapply- Execute the manifest to copy/move files
canon coverage
Show archive coverage statistics - how many sources are hashed and how many are archived.
# Coverage for current directory (when inside a root)
canon coverage
# Scoped to a specific directory
canon coverage /path/to/photos
# Global overview of all source roots
canon coverage --global
# With filters
canon coverage --where 'source.ext=jpg'
# Coverage relative to a specific archive root
canon coverage --archive id:1
canon coverage --archive path:/path/to/archive
# Include archive roots in analysis
canon coverage --include archived
# Include excluded sources
canon coverage --include excluded
# Include both
canon coverage --include all
The output begins with a scope header (Coverage: /path or Coverage: all roots).
Example output (global):
Coverage: all roots
Root: /path/to/backup1 (source)
Total sources: 1,234
Hashed: 1,100 (89.1%)
Archived: 850 (77.3% of hashed)
Unarchived: 250
Root: /path/to/backup2 (source)
Total sources: 567
Hashed: 500 (88.2%)
Archived: 400 (80.0% of hashed)
Unarchived: 100
────────────────────────────────────────
Overall:
Total sources: 1,801
Hashed: 1,600 (88.8%)
Archived: 1,250 (78.1% of hashed)
Unarchived: 350
- Hashed: Sources with a content hash (ready for archiving)
- Archived: Sources whose content exists in an archive root
- With
--archive: Shows “In this archive” vs “Not in archive” for that specific archive
canon cluster generate
Generate a manifest of files matching filters. The --dest flag specifies where files will be copied and must be inside a registered archive root.
# All photos to an archive (unhashed sources are automatically skipped)
canon cluster generate --where 'source.ext IN (jpg, png, heic)' --dest /Volumes/Archive/Photos
# Destination can be a subdirectory within an archive
canon cluster generate --where 'source.ext IN (jpg, png, heic)' --dest /Volumes/Archive/Photos/2024
# Scope to a specific path
canon cluster generate /path/to/photos --dest /Volumes/Archive
# Custom output file
canon cluster generate --where 'source.ext=jpg' --dest /Volumes/Archive -o my-manifest.toml
# Allow sources from archive roots
canon cluster generate --where 'source.ext=jpg' --dest /Volumes/Archive --allow archived
# Allow duplicate content (same hash already in an archive)
canon cluster generate --where 'source.ext=jpg' --dest /Volumes/Archive --allow duplicates
# Show which files were excluded (already archived)
canon cluster generate --where 'source.ext=jpg' --dest /Volumes/Archive --show-archived
# Overwrite existing manifest file
canon cluster generate --where 'source.ext=jpg' --dest /Volumes/Archive --force
The command generates two files: a manifest (.toml) that you edit, and a lock file (.lock) containing the source list.
Typical workflow:
canon cluster generate --where 'source.ext IN (jpg, png, heic)' --dest /Volumes/Archive
# Edit manifest.toml to customize the output pattern
canon apply manifest.toml --dry-run # Preview
canon apply manifest.toml # Execute
Output:
After generating, the command prints a summary showing root breakdown and archive coverage:
Generated manifest: manifest.toml (1,234 sources in manifest.lock)
From 2 roots:
/Volumes/Drive1 (800)
/Volumes/Drive2 (434)
1,234 have no archived copy
Manifest structure:
The generated manifest includes a cluster summary, a notes section for your own annotations, and helpful comments listing available pattern variables:
# === Cluster Summary ===
# 1,234 sources from 2 roots:
# /Volumes/Drive1 (800)
# /Volumes/Drive2 (434)
# 1,234 have no archived copy
# === Notes ===
#
[meta]
version = 1
query = ["source.ext IN ('jpg', 'png', 'heic')"]
scope = "/path/to/photos"
generated_at = "2026-02-28T12:00:00Z"
lock_hash = "abc123..."
[options]
allow = [] # e.g. ["archived", "duplicates"]
[output]
pattern = "{filename}" # ← Edit this to customize organization
base_dir = "/Volumes/Archive"
archive_root_id = 2
# Available facts for pattern (100% coverage on 1234 sources):
# ...
- Cluster Summary is regenerated on each
cluster refresh, showing current source counts, root breakdown, and archive coverage. - Notes section is preserved across refreshes — add your own comments here.
versionfield tracks the manifest format version.[options]records which--allowflags were used during generation. These are carried forward toapplyandcluster refresh.
Common output patterns:
# Flat (default) - all files in base_dir
pattern = "{filename}"
# Preserve original folder structure (relocate as-is)
pattern = "{source.rel_path}"
# By EXIF date
pattern = "{content.DateTimeOriginal|year}/{content.DateTimeOriginal|month}/{filename}"
# By EXIF date with hash prefix (avoids collisions)
pattern = "{content.DateTimeOriginal|year}/{content.DateTimeOriginal|month}/{hash_short}_{filename}"
# By camera model
pattern = "{content.Make}/{content.Model}/{filename}"
# By file type
pattern = "{source.ext}/{filename}"
See Pattern Expressions for the full syntax reference, including modifiers, path accessors, and aliases.
Refreshing the Lock File
Use canon cluster refresh to update the lock file if sources have changed since the manifest was generated:
# Re-query and update the lock file
canon cluster refresh manifest.toml
This re-runs the manifest’s query and updates manifest.lock with the current matching sources. The manifest settings ([options], [output]) remain unchanged.
On refresh:
- The Cluster Summary is regenerated with current counts
- The Notes section is preserved verbatim
- The same root breakdown and archive coverage summary is printed to stdout
canon apply
Apply a manifest to copy/move files. Copied files are automatically registered in the database with the same content hash, so they’re immediately recognized as archived (no separate scan needed).
# Preview what would happen (fast - skips source existence checks)
canon apply manifest.toml --dry-run
# Copy files (default mode, preserves mtime/permissions on Unix)
canon apply manifest.toml
# Show per-file progress during transfer
canon apply manifest.toml --verbose
# Resume a previously interrupted apply
canon apply manifest.toml --resume
# Rename files instead of copying (Unix only, fails on cross-device)
canon apply manifest.toml --rename
# Move files: rename if same device, copy+delete if cross-device
canon apply manifest.toml --move
# Only apply sources from specific roots
canon apply manifest.toml --root id:1 --root id:2
canon apply manifest.toml --root path:/path/to/source
# Allow duplicates within the destination archive
canon apply manifest.toml --allow duplicates
# Allow duplicates across archives (but not within destination)
canon apply manifest.toml --allow cross-archive-duplicates
Transfer modes:
| Flag | Behavior |
|---|---|
| (default) | Copy + preserve mtime/permissions (Unix) |
--rename | Atomic rename; fails if cross-device (Unix only) |
--move | Try rename; fallback to copy+delete on cross-device (Unix only) |
All modes use noclobber semantics: if a destination file exists, apply aborts with an error.
For --rename and --move, the confirmation summary shows which source roots will lose files:
Mode: rename (sources will be relocated)
Files: 150
Sources from:
/Volumes/Drive1 (100 files)
/Volumes/Drive2 (50 files)
Resume mode (--resume):
Use --resume to continue a previously interrupted apply. This is useful when:
- Apply was interrupted (Ctrl+C, system crash, disk full)
- Some files failed to transfer due to errors
Resume mode classifies each destination into one of:
- Already archived - Registered in database, skipped
- Resumed - File exists on disk but not in database, skipped (needs
scanto register) - To transfer - Not in database, not on disk, will be copied
# Resume an interrupted apply
canon apply manifest.toml --resume
# Preview what --resume would do
canon apply manifest.toml --resume --dry-run
If --resume reports “resumed” files, run canon scan on the affected paths to register them:
# Scan only the destination directory that was being written to
canon scan /path/to/archive/2024
If --resume detects files with size mismatches (partial copies from interrupted transfers), it will error and ask you to delete those files before continuing.
Integrity validation:
During transfer, Canon validates each source file’s partial hash (first 8KB + last 8KB) to detect file corruption or modification since the manifest was generated. If validation fails, the transfer is aborted.
Root filtering:
Use --root to apply only a subset of sources from the manifest. Useful for staged application when sources are on different drives.
--root id:N- Filter by root ID (shown in manifest asroot_id)--root path:/path- Filter by root path (must match exactly)
Pre-flight checks (mandatory):
-
Destination collisions - If multiple sources would map to the same destination path (e.g., using
{filename}when sources have duplicate names), apply aborts with an error showing which files conflict. -
Destination path conflicts - In regular mode (without
--resume), checks if any destination paths are already occupied — either registered in the database or existing on disk. If conflicts are found, apply suggests using--resumeto skip already-copied files. -
Stale destination records - If the database shows files as present in the archive but they’re missing from disk, apply aborts. Run
canon scan <archive>to update the database before retrying. -
Archive conflicts - Checks if files already exist in the destination archive or other archives.
-
Excluded sources - Blocks if any sources in the manifest are marked as excluded.
Edit the manifest’s [output] section to customize the destination:
[output]
pattern = "{content.DateTimeOriginal|year}/{content.DateTimeOriginal|month}/{filename}"
base_dir = "/path/to/archive"
Pattern variables use fact keys with optional modifiers (see Pattern Expressions for the full syntax):
{filename},{stem},{ext}- Filename aliases{hash},{hash_short}- Content hash aliases{source.mtime|year},{source.mtime|month}- File modification date{content.DateTimeOriginal|year}- EXIF date with modifier{content.Make},{content.Model}- Any fact key
Recovering from interrupted apply:
If apply is interrupted or encounters errors:
- Fix any reported errors (permissions, disk space, etc.)
- Delete any partial files in the archive (files with wrong sizes from interrupted copies)
- Re-run with
--resume:canon apply manifest.toml --resume
The --resume flag skips files that already exist and transfers only the remaining files. It will detect and report partial files that need deletion.
If --resume reports “resumed” files, scan the destination to register them:
canon scan /path/to/archive/destination-folder
If source files changed during apply, refresh the manifest first:
canon scan <source-paths>
canon cluster refresh manifest.toml
canon apply manifest.toml
Maintenance
Commands for cleaning up and maintaining Canon’s database.
These operations delete data from the database (never from disk). All are dry-run by default — use --yes to execute.
facts delete- Remove incorrect or unwanted metadataprune- Clean up stale, orphaned, or excluded data
canon facts delete
Delete facts by key. Useful for removing incorrect or unwanted metadata.
# Preview deletion (dry-run by default)
canon facts delete content.mime --on object
canon facts delete content.Make --on source /path/to/photos --where 'source.ext=jpg'
# Execute deletion
canon facts delete content.mime --on object --yes
--on sourceor--on objectis required to specify entity type- Protected namespaces (
source.*) cannot be deleted - Dry-run by default; use
--yesto execute
canon prune
Clean up orphaned or stale data from the database.
# Preview stale facts (file changed since fact was recorded)
canon prune --stale-facts
# Preview orphaned objects (no present sources reference them)
canon prune --orphaned-objects
# Preview facts for excluded sources/objects
canon prune --excluded-facts
canon prune --excluded-facts=source # Only source facts
canon prune --excluded-facts=object # Only object facts
# Execute deletion
canon prune --stale-facts --yes
canon prune --orphaned-objects --yes
canon prune --excluded-facts --yes
Stale facts are those where observed_basis_rev no longer matches the source’s current basis_rev (meaning the file was modified after the fact was imported).
Orphaned objects are content entries with no remaining present sources. This can happen when files are deleted. You may want to keep them as a historical record, or delete them to clean up the database.
Excluded facts are metadata for sources or objects you’ve marked as excluded. Since you’ve decided not to archive them, you may want to remove their facts to free up database space.
All prune operations are dry-run by default. Add --yes to execute.
Facts Reference
Facts are key-value metadata. See Concepts: Facts for an overview.
Namespaces
| Namespace | Description |
|---|---|
source.* | Facts about the file on disk (path, size, mtime) |
content.* | Facts about the content (hash, EXIF, mime type) |
object.* | Object-level properties |
The content. prefix is optional when querying. For example, Make=Apple is equivalent to content.Make=Apple.
Values
Facts can hold three value types:
| Type | Examples | Notes |
|---|---|---|
| Text | "Apple", "image/jpeg" | Strings; quote if contains spaces |
| Number | 1024, 3.14, -5 | Integers or decimals |
| Timestamp | 1704067200 | Unix timestamps; enable date modifiers |
Modifiers
Transform values using | syntax:
Time Modifiers
For timestamp values (like source.mtime or EXIF dates):
| Modifier | Output | Example |
|---|---|---|
year | 4-digit year | 2024 |
month | 2-digit month | 07 |
day | 2-digit day | 23 |
hour | 2-digit hour (24h) | 14 |
minute | 2-digit minute | 30 |
second | 2-digit second | 45 |
date | ISO date | 2024-07-23 |
time | ISO time | 14:30:45 |
datetime | ISO datetime | 2024-07-23T14:30:45 |
yearmonth | Year-month | 2024-07 |
week | ISO week number | 30 |
weekday | Day of week (Mon=1) | 2 |
quarter | Quarter (1-4) | 3 |
String Modifiers
| Modifier | Description | Example |
|---|---|---|
lowercase | Convert to lowercase | JPG → jpg |
uppercase | Convert to uppercase | jpg → JPG |
capitalize | Capitalize first letter | apple → Apple |
stem | Filename without extension | photo.jpg → photo |
ext | File extension | photo.jpg → jpg |
short | First 8 characters | abc123def456 → abc123de |
Numeric Modifiers
| Modifier | Description |
|---|---|
bucket | Group into ranges (1-10, 10-100, etc.) |
bucket(a,b,c) | Custom ranges (<a, a-b, b-c, >c) |
Example: source.size|bucket groups file sizes into human-readable ranges.
Path Accessors
Python-style indexing for path values:
| Syntax | Meaning |
|---|---|
key[-1] | Last segment (filename) |
key[0] | First segment |
key[1:3] | Slice segments 1 and 2 |
key[:-1] | All but last segment |
Accessors can be combined with modifiers:
source.rel_path[-1] → IMG_001.jpg
source.rel_path[-1]|stem → IMG_001
source.rel_path[0] → photos
See Also
- Built-in Facts - Complete list of automatic facts
- Filters - Using facts in queries
- Pattern Expressions - Using facts in archive patterns
Built-in Facts Reference
These facts are automatically available for all sources without enrichment.
Source Facts
| Fact | Type | Description |
|---|---|---|
source.id | num | Database ID (hidden*) |
source.ext | text | File extension (lowercase, no dot) |
source.size | num | File size in bytes |
source.mtime | time | Modification timestamp |
source.path | path | Full absolute path |
source.root | path | Root directory path (hidden) |
source.rel_path | path | Path relative to root (hidden) |
source.device | num | Device ID (hidden) |
source.inode | num | Inode number (hidden) |
Content Facts
| Fact | Type | Description |
|---|---|---|
content.hash.sha256 | text | SHA-256 content hash |
Pattern Aliases
These aliases are available in pattern expressions:
| Alias | Expands To |
|---|---|
filename | source.rel_path[-1] |
stem | source.rel_path[-1]|stem |
ext | source.rel_path[-1]|ext |
hash | content.hash.sha256 |
hash_short | content.hash.sha256|short |
id | source.id |
*Hidden facts are not shown in canon facts by default. Use --all to include them.
Filter Syntax
Filters select sources based on facts using a boolean expression language. Most commands accept --where to filter which sources they operate on. Multiple --where flags are combined with AND.
Operators
Basic
| Syntax | Meaning |
|---|---|
key? | Fact exists |
key=value | Fact equals value (case-sensitive) |
key!=value | Fact doesn’t equal value (case-sensitive) |
key~pattern | Glob pattern match (case-sensitive) |
key!~pattern | Glob pattern doesn’t match |
key>value | Greater than (numbers/dates) |
key>=value | Greater or equal |
key<value | Less than |
key<=value | Less or equal |
key IN (v1, v2, ...) | Fact matches any value in list |
key NOT IN (v1, v2, ...) | Fact doesn’t match any value in list |
Glob Patterns
The ~ operator supports shell-style glob patterns:
| Pattern | Meaning |
|---|---|
* | Match zero or more characters |
? | Match exactly one character |
[abc] | Match any character in set |
[a-z] | Match character range |
[!abc] | Match any character NOT in set |
\* | Literal asterisk (escape) |
# Files starting with IMG_
--where 'filename~IMG_*'
# Files with 3-letter extension
--where 'source.ext~???'
# Files in a year subdirectory
--where 'source.rel_path~*/2024/*'
# Exclude temp files
--where 'filename!~*.tmp'
Values after operators like ~, =, != accept most characters without quoting — including /, -, ?, *, [, ]. Quoting (single or double) is still supported for values containing spaces or parentheses.
Boolean Operators
| Syntax | Meaning |
|---|---|
expr AND expr | Both conditions must match |
expr OR expr | Either condition matches |
NOT expr | Negates the condition |
(expr) | Grouping for precedence |
Operator precedence (highest to lowest): NOT, AND, OR. Use parentheses to override.
Aliases
You can define named aliases in $CANON_HOME/aliases.toml (by default ~/.canon/aliases.toml). There are two kinds of aliases, and Canon classifies them automatically — just define the value and use it:
Expression Aliases
Shorthand for complete filter predicates. These are values that contain an operator (like =, >, IN, etc.):
image = "content.mime IN ('image/jpeg', 'image/png', 'image/gif', 'image/tiff', 'image/webp', 'image/heic')"
video = "content.mime IN ('video/mp4', 'video/quicktime', 'video/x-msvideo', 'video/x-matroska')"
tens = "source.mtime|year >= 2010 AND source.mtime|year < 2020"
large = "source.size > 10000000"
Expression aliases are wrapped in parentheses when expanded, so boolean logic inside them composes safely:
canon ls --where '@image AND @tens'
# Expands to: (content.mime IN (...)) AND (source.mtime|year >= 2010 AND source.mtime|year < 2020)
Key Aliases
Shorthand for verbose key paths — accessors, modifiers, and namespaces. These are values that are just a key (no operator):
filename = "source.rel_path[-1]"
parent = "source.rel_path[-2]"
ext = "source.ext|lowercase"
year = "source.mtime|year"
taken = "content.DateTimeOriginal"
yearmonth = "content.DateTimeOriginal|yearmonth"
Key aliases are substituted literally and used with operators in your filter:
canon ls --where '@filename = "photo.jpg"'
# Expands to: source.rel_path[-1] = "photo.jpg"
canon ls --where '@yearmonth >= 202301'
# Expands to: content.DateTimeOriginal|yearmonth >= 202301
canon ls --where '@ext = "jpg" AND @year >= 2020'
# Expands to: source.ext|lowercase = "jpg" AND source.mtime|year >= 2020
Using Aliases
Reference aliases with @name in any --where expression:
# Expression alias standalone
canon ls --where '@image'
# Compose expression aliases
canon ls --where '@image OR @video'
# Key alias with operator
canon ls --where '@filename ~ "IMG_*"'
# Mix both kinds
canon ls --where '@image AND @year >= 2020'
# Negate an expression alias
canon ls --where 'NOT @large'
How Classification Works
Canon automatically determines whether each alias is a key or an expression by parsing the value. If the value is a valid filter expression (contains an operator), it’s an expression alias and gets wrapped in parentheses. If not (it’s just a key path), it’s a key alias and gets substituted literally. You don’t need to think about this — just define your alias and use it.
Rules:
- Alias names must start with a letter and can contain letters, digits, underscores, and hyphens
@inside quoted strings is treated as a literal character, not an alias reference- Nested aliases are not supported (
@in alias values is literal) - The aliases file is only loaded when
@appears in a--whereargument - If the file doesn’t exist and no
@aliases are used, no error is raised
Using Modifiers
Modifiers can be applied to fact keys using the | syntax. See Facts for the complete list.
# Files from 2024
--where 'source.mtime|year=2024'
# January photos
--where 'content.DateTimeOriginal|month=1'
# Case-insensitive extension matching
--where 'source.ext|lowercase=jpg'
# Case-insensitive glob
--where 'filename|lowercase~img_*'
Examples
# Files with a content hash
--where 'content.hash.sha256?'
# Files missing a content hash
--where 'NOT content.hash.sha256?'
# JPG files only
--where 'source.ext=jpg'
# JPG or PNG files
--where 'source.ext=jpg OR source.ext=png'
# Common image formats
--where 'source.ext IN (jpg, png, gif, webp)'
# Exclude certain extensions
--where 'source.ext NOT IN (tmp, bak, log)'
# Not temporary files
--where 'NOT source.ext=tmp'
# iPhone photos (content. prefix is optional)
--where 'Make=Apple'
# Files larger than 1MB
--where 'source.size>1000000'
# Files modified in 2024 or later
--where 'source.mtime>=2024-01-01'
# Large images (combining with parentheses)
--where '(source.ext=jpg OR source.ext=png) AND source.size>1000000'
# Multiple --where flags combine with AND
--where 'source.ext=jpg' --where 'content.Make=Apple'
Pattern Expressions
Pattern expressions define how files are organized in archives. They use {expr} syntax to insert dynamic values based on facts.
Patterns are used in the pattern field of cluster manifests. When you run canon cluster generate, it creates a manifest with a default pattern = "{filename}" that you can customize.
Basic Syntax
Patterns consist of literal path segments and expressions in curly braces:
{content.DateTimeOriginal|year}/{content.DateTimeOriginal|month}/{filename}
This would produce paths like: 2024/07/IMG_001.jpg
Fact Keys
Any fact key can be used in a pattern:
{source.ext}- File extension{source.mtime}- Modification time{content.Make}- Camera manufacturer (from EXIF){content.hash.sha256}- Content hash
The content. prefix is optional for content facts, so {Make} is equivalent to {content.Make}.
Modifiers
Transform values using the | syntax. See Facts for the complete list.
{source.mtime|year} → 2024
{source.mtime|yearmonth} → 2024-07
{content.hash.sha256|short} → a1b2c3d4
{source.ext|uppercase} → JPG
Multiple modifiers can be chained:
{filename|stem|lowercase} → img_001
Path Accessors
Extract segments from path values using Python-style indexing:
| Syntax | Meaning |
|---|---|
key[-1] | Last segment (filename) |
key[0] | First segment |
key[1:3] | Slice segments 1 and 2 |
key[:-1] | All but last segment |
Examples with source.rel_path = "photos/2024/vacation/IMG_001.jpg":
{source.rel_path[-1]} → IMG_001.jpg
{source.rel_path[0]} → photos
{source.rel_path[1:-1]} → 2024/vacation
{source.rel_path[-1]|stem} → IMG_001
Aliases
Aliases provide shorthand for common expressions. Use canon facts --show-aliases to see all available aliases.
| Alias | Expands To |
|---|---|
filename | source.rel_path[-1] |
stem | source.rel_path[-1]|stem |
ext | source.rel_path[-1]|ext |
hash | content.hash.sha256 |
hash_short | content.hash.sha256|short |
id | source.id |
Example using aliases:
{hash_short}_{filename} → a1b2c3d4_IMG_001.jpg
Missing Values
Canon requires all facts used in a pattern to have values for every source. If any source is missing a required fact, canon apply will refuse to proceed and report which facts are missing.
When you run canon cluster generate, the manifest includes comments listing all facts with 100% coverage—these are safe to use in your pattern.
If sources are missing required facts, you can:
- Filter them out during generation:
--where 'DateTimeOriginal?' - Import the missing facts via the enrichment pipeline
Common Patterns
# Flat (all files in one directory)
pattern = "{filename}"
# Preserve original structure
pattern = "{source.rel_path}"
# By EXIF capture date
pattern = "{content.DateTimeOriginal|year}/{content.DateTimeOriginal|month}/{filename}"
# By date with hash prefix (collision-safe)
pattern = "{content.DateTimeOriginal|date}/{hash_short}_{filename}"
# By camera
pattern = "{content.Make}/{content.Model}/{filename}"
# By file type and year
pattern = "{source.ext}/{source.mtime|year}/{filename}"