Automation
file-deduplicator
Find and remove duplicate files intelligently.
---
name: file-deduplicator
description: Find and remove duplicate files intelligently. Save storage space, keep your system clean. Perfect for digital hoarders and document management.
metadata:
{
"openclaw":
{
"version": "1.0.0",
"author": "Vernox",
"license": "MIT",
"tags": ["deduplication", "storage", "cleanup", "file-management", "duplicate", "disk-space"],
"category": "tools"
}
}
---
# File-Deduplicator - Find and Remove Duplicates
**Vernox Utility Skill - Clean up your digital hoard.**
## Overview
File-Deduplicator is an intelligent file duplicate finder and remover. Uses content hashing to identify identical files across directories, then provides options to remove duplicates safely.
## Features
### ✅ Duplicate Detection
- Content-based hashing (MD5) for fast comparison
- Size-based detection (exact match, near match)
- Name-based detection (similar filenames)
- Directory scanning (recursive)
- Exclude patterns (.git, node_modules, etc.)
### ✅ Removal Options
- Auto-delete duplicates (keep newest/oldest)
- Interactive review before deletion
- Move to archive instead of delete
- Preserve permissions and metadata
- Dry-run mode (preview changes)
### ✅ Analysis Tools
- Duplicate count summary
- Space savings estimation
- Largest duplicate files
- Most common duplicate patterns
- Detailed report generation
### ✅ Safety Features
- Confirmation prompts before deletion
- Backup to archive folder
- Size threshold (don't remove huge files by mistake)
- Whitelist important directories
- Undo functionality (log for recovery)
## Installation
```bash
clawhub install file-deduplicator
```
## Quick Start
### Find Duplicates in Directory
```javascript
const result = await findDuplicates({
directories: ['./documents', './downloads', './projects'],
options: {
method: 'content', // content-based comparison
includeSubdirs: true
}
});
console.log(`Found ${result.duplicateCount} duplicate groups`);
console.log(`Potential space savings: ${result.spaceSaved}`);
```
### Remove Duplicates Automatically
```javascript
const result = await removeDuplicates({
directories: ['./documents', './downloads'],
options: {
method: 'content',
keep: 'newest', // keep newest, delete oldest
action: 'delete', // or 'move' to archive
autoConfirm: false // show confirmation for each
}
});
console.log(`Removed ${result.filesRemoved} duplicates`);
console.log(`Space saved: ${result.spaceSaved}`);
```
### Dry-Run Preview
```javascript
const result = await removeDuplicates({
directories: ['./documents', './downloads'],
options: {
method: 'content',
keep: 'newest',
action: 'delete',
dryRun: true // Preview without actual deletion
}
});
console.log('Would remove:');
result.duplicates.forEach((dup, i) => {
console.log(`${i+1}. ${dup.file}`);
});
```
## Tool Functions
### `findDuplicates`
Find duplicate files across directories.
**Parameters:**
- `directories` (array|string, required): Directory paths to scan
- `options` (object, optional):
- `method` (string): 'content' | 'size' | 'name' - comparison method
- `includeSubdirs` (boolean): Scan recursively (default: true)
- `minSize` (number): Minimum size in bytes (default: 0)
- `maxSize` (number): Maximum size in bytes (default: 0)
- `excludePatterns` (array): Glob patterns to exclude (default: ['.git', 'node_modules'])
- `whitelist` (array): Directories to never scan (default: [])
**Returns:**
- `duplicates` (array): Array of duplicate groups
- `duplicateCount` (number): Number of duplicate groups found
- `totalFiles` (number): Total files scanned
- `scanDuration` (number): Time taken to scan (ms)
- `spaceWasted` (number): Total bytes wasted by duplicates
- `spaceSaved` (number): Potential savings if duplicates removed
### `removeDuplicates`
Remove duplicate files based on findings.
**Parameters:**
- `directories` (array|string, required): Same as findDuplicates
- `options` (object, optional):
- `keep` (string): 'newest' | 'oldest' | 'smallest' | 'largest' - which to keep
- `action` (string): 'delete' | 'move' | 'archive'
- `archivePath` (string): Where to move files when action='move'
- `dryRun` (boolean): Preview without actual action
- `autoConfirm` (boolean): Auto-confirm deletions
- `sizeThreshold` (number): Don't remove files larger than this
**Returns:**
- `filesRemoved` (number): Number of files removed/moved
- `spaceSaved` (number): Bytes saved
- `groupsProcessed` (number): Number of duplicate groups handled
- `logPath` (string): Path to action log
- `errors` (array): Any errors encountered
### `analyzeDirectory`
Analyze a single directory for duplicates.
**Parameters:**
- `directory` (string, required): Path to directory
- `options` (object, optional): Same as findDuplicates options
**Returns:**
- `fileCount` (number): Total files in directory
- `totalSize` (number): Total bytes in directory
- `duplicateSize` (number): Bytes in duplicate files
- `duplicateRatio` (number): Percentage of files that are duplicates
## Use Cases
### Digital Hoarder Cleanup
- Find duplicate photos/videos
- Identify wasted storage space
- Remove old duplicates, keep newest
- Clean up download folders
### Document Management
- Find duplicate PDFs, docs, reports
- Keep latest version, archive old versions
- Prevent version confusion
- Reduce backup bloat
### Project Cleanup
- Find duplicate source files
- Remove duplicate build artifacts
- Clean up node_modules duplicates
- Save storage on SSD/HDD
### Backup Optimization
- Find duplicate backup files
- Remove redundant backups
- Identify what's actually duplicated
- Save space on backup drives
## Configuration
### Edit `config.json`:
```json
{
"detection": {
"defaultMethod": "content",
"sizeTolerancePercent": 0, // exact match only
"nameSimilarity": 0.7, // 0-1, lower = more similar
"includeSubdirs": true
},
"removal": {
"defaultAction": "delete",
"defaultKeep": "newest",
"archivePath": "./archive",
"sizeThreshold": 10485760, // 10MB threshold
"autoConfirm": false,
"dryRunDefault": false
},
"exclude": {
"patterns": [".git", "node_modules", ".vscode", ".idea"],
"whitelist": ["important", "work", "projects"]
}
}
```
## Methods
### Content-Based (Recommended)
- Fast MD5 hashing
- Detects exact duplicates regardless of filename
- Works across renamed files
- Perfect for documents, code, archives
### Size-Based
- Compares file sizes
- Faster than content hashing
- Good for media files where content hashing is slow
- Finds near-duplicates (similar but not exact)
### Name-Based
- Compares filenames
- Detects similar named files
- Good for finding version duplicates (file_v1, file_v2)
## Examples
### Find Duplicates in Documents
```javascript
const result = await findDuplicates({
directories: '~/Documents',
options: {
method: 'content',
includeSubdirs: true
}
});
console.log(`Found ${result.duplicateCount} duplicate sets`);
result.duplicates.slice(0, 5).forEach((set, i) => {
console.log(`Set ${i+1}: ${set.files.length} files`);
console.log(` Total size: ${set.totalSize} bytes`);
});
```
### Remove Duplicates, Keep Newest
```javascript
const result = await removeDuplicates({
directories: '~/Documents',
options: {
keep: 'newest',
action: 'delete'
}
});
console.log(`Removed ${result.filesRemoved} files`);
console.log(`Saved ${result.spaceSaved} bytes`);
```
### Move to Archive Instead of Delete
```javascript
const result = await removeDuplicates({
directories: '~/Downloads',
options: {
keep: 'newest',
action: 'move',
archivePath: '~/Documents/Archive'
}
});
console.log(`Archived ${result.filesRemoved} files`);
console.log(`Safe in: ~/Documents/Archive`);
```
### Dry-Run Preview Changes
```javascript
const result = await removeDuplicates({
directories: '~/Documents',
options: {
dryRun: true // Just show what would happen
}
});
console.log('=== Dry Run Preview ===');
result.duplicates.forEach((set, i) => {
console.log(`Would delete: ${set.toDelete.join(', ')}`);
});
```
## Performance
### Scanning Speed
- **Small directories** (<1000 files): <1s
- **Medium directories** (1000-10000 files): 1-5s
- **Large directories** (10000+ files): 5-20s
### Detection Accuracy
- **Content-based:** 100% (exact duplicates)
- **Size-based:** Fast but may miss renamed files
- **Name-based:** Detects naming patterns only
### Memory Usage
- **Hash cache:** ~1MB per 100,000 files
- **Batch processing:** Processes 1000 files at a time
- **Peak memory:** ~200MB for 1M files
## Safety Features
### Size Thresholding
Won't remove files larger than configurable threshold (default: 10MB). Prevents accidental deletion of important large files.
### Archive Mode
Move files to archive directory instead of deleting. No data loss, full recoverability.
### Action Logging
All deletions/moves are logged to file for recovery and audit.
### Undo Functionality
Log file can be used to restore accidentally deleted files (limited undo window).
## Error Handling
### Permission Errors
- Clear error message
- Suggest running with sudo
- Skip files that can't be accessed
### File Lock Errors
- Detect locked files
- Skip and report
- Suggest closing applications using files
### Space Errors
- Check available disk space before deletion
- Warn if space is critically low
- Prevent disk-full scenarios
## Troubleshooting
### Not Finding Expected Duplicates
- Check detection method (content vs size vs name)
- Verify exclude patterns aren't too broad
- Check if files are in whitelisted directories
- Try with includeSubdirs: false
### Deletion Not Working
- Check write permissions on directories
- Verify action isn't 'delete' with autoConfirm: true
- Check size threshold isn't blocking all deletions
- Check file locks (is another program using files?)
### Slow Scanning
- Reduce includeSubdi
... (truncated)
automation
By
Comments
Sign in to leave a comment