Writing Processors
Processors are scripts or programs that read worklist entries, extract metadata from files, and output facts for import.
Input and Output
A processor reads JSONL from worklist and writes JSONL for import-facts.
Input (from worklist):
{"source_id":123,"path":"/photos/IMG_001.jpg","basis_rev":0,"size":1024,"mtime":1703980800}
Output (for import-facts):
{"source_id":123,"basis_rev":0,"facts":{"Make":"Apple","Model":"iPhone 12"}}
The processor must pass through source_id and basis_rev unchanged.
Custom Processors
Read JSONL from stdin, extract facts from each file, output JSONL to stdout:
#!/bin/bash
while IFS= read -r line; do
source_id=$(echo "$line" | jq -r '.source_id')
basis_rev=$(echo "$line" | jq -r '.basis_rev')
path=$(echo "$line" | jq -r '.path')
# Extract facts (example: EXIF data)
facts=$(exiftool -json -Make -Model "$path" 2>/dev/null | jq '.[0]')
jq -nc \
--argjson source_id "$source_id" \
--argjson basis_rev "$basis_rev" \
--argjson facts "$facts" \
'{source_id: $source_id, basis_rev: $basis_rev, facts: $facts}'
done
The canonargs Helper
If you don’t want to handle JSONL parsing and output formatting yourself, canonargs takes care of that. You only provide a command that extracts data from a single file.
Installation
cargo install canonargs
Single Fact Mode
When your command outputs a single value:
canon worklist | canonargs --fact mime -- file -b --mime-type {} | canon import-facts
The {} is replaced with the file path. The command’s stdout becomes the fact value.
Default behavior: Values are stored as text. To specify a type, add --type:
# Store as datetime (enables |year, |month modifiers)
canon worklist | canonargs --fact DateTimeOriginal --type datetime -- exiftool -DateTimeOriginal -s3 {} | canon import-facts
# Store image width as number (using ImageMagick's identify)
canon worklist | canonargs --fact width --type number -- identify -format '%w' {} | canon import-facts
Valid types: datetime, duration, number
Key-Value Mode
When your command outputs key=value pairs (one per line):
canon worklist | canonargs --kv -- my-extractor {} | canon import-facts
Default behavior: All values are stored as text. To specify types, use key:type=value syntax:
width:number=1920
height:number=1080
DateTimeOriginal:datetime=2024:07:23 14:30:00
codec=h264
JSON Mode
When your command outputs a JSON object:
canon worklist | canonargs --json -- exiftool -json {} | canon import-facts
Example extractor output:
{"Make": "Apple", "Model": "iPhone 12", "DateTimeOriginal": "2024:07:23 14:30:00"}
JSON mode auto-detects numbers. If your command outputs "width": 1920 (a JSON number), it’s stored as a number. If it outputs "width": "1920" (a quoted string), it’s stored as text.
For datetime fields, you still need to use the typed hint format:
{"DateTimeOriginal": {"value": "2024:07:23 14:30:00", "type": "datetime"}}
Chaining
Processors can be chained since canonargs passes through the worklist entry and merges facts:
canon worklist \
| canonargs --fact mime -- file -b --mime-type {} \
| canonargs --json -- exiftool -json {} \
| canon import-facts
Using Existing Facts
Processors can access previously imported facts via the --emit flag on worklist. See Emitting Existing Facts for details.
Type Hints
Important: The type of a fact determines what operations work on it:
- Timestamps enable
|year,|monthmodifiers and date comparisons (>=2024-01-01) - Numbers enable numeric comparisons (
>1000) and|bucketmodifier - Text enables string matching and
|lowercase,|stemmodifiers
If your processor outputs dates as strings or numbers as strings, add type hints:
{"source_id":123,"basis_rev":0,"facts":{
"DateTimeOriginal": {"value": "2024:07:23 11:06:32", "type": "datetime"},
"duration": {"value": "1:23:45", "type": "duration"},
"width": 1920
}}
Without "type": "datetime", a date string like "2024:07:23 11:06:32" is stored as text and --where 'DateTimeOriginal|year=2024' won’t work.
Numbers from JSON are automatically stored as numbers. But if your extractor outputs "width": "1920" (a string), numeric comparisons like --where 'width>1000' won’t work as expected.
See import-facts for full details.
Tips
- Always pass through
source_idandbasis_revunchanged - Use
jq -cfor compact JSON output (one object per line) - Handle errors gracefully—skip files that can’t be processed
- Use type hints for datetime fields so modifiers work correctly
- Ensure numbers are actual JSON numbers, not quoted strings