DeDuper: File Deduplication using SHA-256 and AWS Lambda

Cloud Computing Course Work

1 month

AWS Project

File Deduplication System on AWS with SHA-256

Project Overview

The File Deduplication System is a cloud-native solution that optimizes storage efficiency while maintaining seamless file access. Using SHA-256 cryptographic hashing, the system identifies duplicate files and stores only unique instances in AWS S3, while maintaining user-specific references through metadata management.

My Approach

Built a scalable architecture utilizing AWS Lambda for serverless processing, S3 for optimized storage, and DynamoDB for metadata management
Implemented a FastAPI middleware layer to handle file routing, user authentication, and system orchestration
Developed client-side SHA-256 computation to identify duplicates before transmission, reducing bandwidth usage
Integrated AWS KMS encryption to ensure data security at rest and in transit

Key Features at a Glance

Zero-Duplicate Storage – Files with identical content are stored only once, regardless of filename or owner
Transparent User Experience – Users interact with their files normally, unaware of backend deduplication
Lightweight SHA-256 Processing – Hash calculation occurs client-side to minimize bandwidth usage
AWS Lambda Integration – Serverless processing eliminates infrastructure management overhead
DynamoDB Metadata Storage – High-performance NoSQL database manages file-user relationships
KMS Encryption – End-to-end encryption ensures data security throughout the processing pipeline

Architecture and Optimization

The system prioritizes performance with:

Client-side hash calculation to avoid unnecessary uploads
Serverless processing for automatic scaling during high-demand periods
Layered security with request signing and KMS encryption
Metadata-driven access for rapid file retrieval without scanning S3 buckets

By leveraging AWS's managed services and implementing efficient algorithms, the system becomes more than a file storage solution—it becomes an intelligent storage optimizer.

Flow at a Glance

1️⃣ File Selection → User selects a file for upload
2️⃣ SHA-256 Calculation → Client calculates the file's unique fingerprint
3️⃣ Deduplication Check → System verifies if the file already exists in storage
4️⃣ Conditional Upload → New files are uploaded; duplicates are referenced
5️⃣ Metadata Registration → User-file relationship is recorded in DynamoDB
6️⃣ Seamless Access → Users access their files through personalized view

Conclusion

The File Deduplication System represents a significant advancement in cloud storage efficiency, ensuring organizations can maximize storage utilization while maintaining seamless user experience and robust security.

DeDuper: File Deduplication using SHA-256 and AWS Lambda

DeDuper: File Deduplication using SHA-256 and AWS Lambda

DeDuper: File Deduplication using SHA-256 and AWS Lambda

File Deduplication System on AWS with SHA-256

Project Overview

My Approach

Key Features at a Glance

Architecture and Optimization

Flow at a Glance

Conclusion

Other Projects

Orderly: Track. Refund. Return. Resolve.

Orderly: Track. Refund. Return. Resolve.

Q-Pilot: Your Co-Pilot for Every Metro Ride

Q-Pilot: Your Co-Pilot for Every Metro Ride