Source vs. Object
Understanding the relationship between sources and objects is key to how Canon handles deduplication and archive tracking.
Sources Are Locations
When a root is scanned, Canon indexes every file it finds as a source. Each source represents a specific file at a specific path.
Objects Are Content
When sources are hashed, Canon creates or links them to objects. An object represents the underlying content, independent of where it was found.
Source A: /backup1/photos/IMG_001.jpg ─┐
Source B: /backup2/old/IMG_001.jpg ─┼─► Object (hash: abc123...)
Source C: /downloads/photo.jpg ─┘
All three sources above have identical content, so they reference the same object.
Fact Sharing
When a source is linked to an object:
- Content facts (like EXIF metadata) can be stored on the object and become available to all sources with that hash
- Source facts (like file path) remain specific to each source
This allows metadata to flow between different copies of the same content. Import a fact once, and it’s available everywhere that content exists.
Archive Tracking
Canon uses the source-object relationship to track archiving progress:
- When you archive a file, Canon copies it to an archive root and records the object’s hash
- Any source with that same hash is now considered “archived”
- The
coveragecommand shows how many of your sources exist in an archive
Hashing
By default, Canon hashes all files during scanning. Since hashing can be time-consuming for large collections, you can:
- Use
--no-hashduring scan to skip hashing initially - Hash selectively via the enrichment pipeline, targeting specific file types
Unhashed sources cannot be linked to objects, so they cannot be deduplicated or tracked for archive coverage.