DeDuper: File Deduplication using SHA-256 and AWS Lambda

DeDuper: File Deduplication using SHA-256 and AWS Lambda

DeDuper: File Deduplication using SHA-256 and AWS Lambda

Cloud Computing Course Work
1 month
AWS Project

File Deduplication System on AWS with SHA-256

Project Overview

The File Deduplication System is a cloud-native solution that optimizes storage efficiency while maintaining seamless file access. Using SHA-256 cryptographic hashing, the system identifies duplicate files and stores only unique instances in AWS S3, while maintaining user-specific references through metadata management.

My Approach

  • Built a scalable architecture utilizing AWS Lambda for serverless processing, S3 for optimized storage, and DynamoDB for metadata management

  • Implemented a FastAPI middleware layer to handle file routing, user authentication, and system orchestration

  • Developed client-side SHA-256 computation to identify duplicates before transmission, reducing bandwidth usage

  • Integrated AWS KMS encryption to ensure data security at rest and in transit

Key Features at a Glance

  1. Zero-Duplicate Storage – Files with identical content are stored only once, regardless of filename or owner

  2. Transparent User Experience – Users interact with their files normally, unaware of backend deduplication

  3. Lightweight SHA-256 Processing – Hash calculation occurs client-side to minimize bandwidth usage

  4. AWS Lambda Integration – Serverless processing eliminates infrastructure management overhead

  5. DynamoDB Metadata Storage – High-performance NoSQL database manages file-user relationships

  6. KMS Encryption – End-to-end encryption ensures data security throughout the processing pipeline

Architecture and Optimization

The system prioritizes performance with:

  1. Client-side hash calculation to avoid unnecessary uploads

  2. Serverless processing for automatic scaling during high-demand periods

  3. Layered security with request signing and KMS encryption

  4. Metadata-driven access for rapid file retrieval without scanning S3 buckets

By leveraging AWS's managed services and implementing efficient algorithms, the system becomes more than a file storage solution—it becomes an intelligent storage optimizer.

Flow at a Glance

1️⃣ File Selection → User selects a file for upload
2️⃣ SHA-256 Calculation → Client calculates the file's unique fingerprint
3️⃣ Deduplication Check → System verifies if the file already exists in storage
4️⃣ Conditional Upload → New files are uploaded; duplicates are referenced
5️⃣ Metadata Registration → User-file relationship is recorded in DynamoDB
6️⃣ Seamless Access → Users access their files through personalized view

Conclusion

The File Deduplication System represents a significant advancement in cloud storage efficiency, ensuring organizations can maximize storage utilization while maintaining seamless user experience and robust security.

Other Projects

Let's Connect!

Let's Connect!

Let's Connect!

Create a free website with Framer, the website builder loved by startups, designers and agencies.