File Deduplication System on AWS with SHA-256
Project Overview
The File Deduplication System is a cloud-native solution that optimizes storage efficiency while maintaining seamless file access. Using SHA-256 cryptographic hashing, the system identifies duplicate files and stores only unique instances in AWS S3, while maintaining user-specific references through metadata management.
My Approach
Built a scalable architecture utilizing AWS Lambda for serverless processing, S3 for optimized storage, and DynamoDB for metadata management
Implemented a FastAPI middleware layer to handle file routing, user authentication, and system orchestration
Developed client-side SHA-256 computation to identify duplicates before transmission, reducing bandwidth usage
Integrated AWS KMS encryption to ensure data security at rest and in transit
Key Features at a Glance
Zero-Duplicate Storage – Files with identical content are stored only once, regardless of filename or owner
Transparent User Experience – Users interact with their files normally, unaware of backend deduplication
Lightweight SHA-256 Processing – Hash calculation occurs client-side to minimize bandwidth usage
AWS Lambda Integration – Serverless processing eliminates infrastructure management overhead
DynamoDB Metadata Storage – High-performance NoSQL database manages file-user relationships
KMS Encryption – End-to-end encryption ensures data security throughout the processing pipeline
Architecture and Optimization

The system prioritizes performance with:
Client-side hash calculation to avoid unnecessary uploads
Serverless processing for automatic scaling during high-demand periods
Layered security with request signing and KMS encryption
Metadata-driven access for rapid file retrieval without scanning S3 buckets
By leveraging AWS's managed services and implementing efficient algorithms, the system becomes more than a file storage solution—it becomes an intelligent storage optimizer.
Flow at a Glance
1️⃣ File Selection → User selects a file for upload
2️⃣ SHA-256 Calculation → Client calculates the file's unique fingerprint
3️⃣ Deduplication Check → System verifies if the file already exists in storage
4️⃣ Conditional Upload → New files are uploaded; duplicates are referenced
5️⃣ Metadata Registration → User-file relationship is recorded in DynamoDB
6️⃣ Seamless Access → Users access their files through personalized view
Conclusion
The File Deduplication System represents a significant advancement in cloud storage efficiency, ensuring organizations can maximize storage utilization while maintaining seamless user experience and robust security.


