Overview
- Client: Mid-sized law firm based in California
- Project: Production
- Data Set: Videos, Images, and PDF versions of certain images
- Data Size: 1.19 GB (868 Documents)
- Technology: Reveal, Relativity, and Warp9 workflows
- Turnaround Time: ~26 hours
Key Results
- Document reduction: 868 → 593 (~32%)
- Cross-format deduplication achieved (PDF vs native images)
- All duplicate pairs validated through manual review
- Production delivered within ~26 hours of request
Background
The client is a mid-sized, full-service law firm based in California, serving businesses, organizations, and individuals across a broad range of legal matters. For this request, the client required the production of searchable PDF files, with corresponding native files provided separately when applicable. The source data consisted of videos, images, and PDF versions of certain images.
The Challenge
A key instruction, and corresponding challenge, centered on deduplication. The client required that only one version of each image be retained for production. However, the dataset contained both PDF and native versions of the same images, raising the issue of whether duplicates could be accurately identified and removed across different file formats.
Compounding the challenge, the team’s primary software could not deduplicate across file types (i.e., comparing native image files against PDFs). This limitation required the team to adopt a more strategic and resourceful approach to meet the client’s expectations.
Approach
Where platform limitations restricted cross-format comparison, the workflow was extended through external tools and expert-led validation to ensure complete and accurate results.
To address the issue, the team implemented a two-pronged deduplication strategy:
- File Name Matching – Based on random quality checks, documents where the PDF and image shared identical file names were found to be exact matches. These were also flagged as duplicates by an external application. The team proposed retaining only the PDF versions and excluding the native image counterparts, subject to client confirmation.
- External Application Analysis (Non-Matching File Names) – For files where the PDF and image names differed, the team leveraged an external application to identify potential duplicates. Each flagged pair was then manually reviewed and visually confirmed to ensure accuracy.
Similarity Threshold Review
- ~98% similarity: Minor differences in quality or file size
- ~90–96% similarity: Slight variations in image coverage or alignment
All duplicate determinations were validated through structured review, ensuring the process remains defensible if questioned.
Result
The dataset was reduced from 868 to 593 documents, eliminating duplicate content across formats while preserving all required materials.
All files were reviewed, validated, and produced within approximately 26 hours from request to delivery. This was achieved despite the platform’s inability to perform cross-format deduplication natively.
Key Takeaways
Proactive Communication. Clients are kept informed early about any challenges or system limitations, together with clear plans on how these will be addressed, ensuring transparency throughout the project.
Solution-Oriented Approach. The team does not rely solely on standard tools. When limitations arise, they actively develop alternative workflows and leverage additional resources to meet project requirements.
Commitment to Accuracy and Quality. Through a combination of technology and manual validation, the team ensures that outputs are precise, reliable, and defensible.
Flexibility and Adaptability. The team can adjust strategies based on the dataset’s complexity and client needs.
Efficient Turnaround Without Compromising Quality. Even when faced with technical constraints, the team delivers results promptly while maintaining high standards.
Attention to Detail. Every document is carefully reviewed when necessary, ensuring that even near-duplicates or subtle differences are properly evaluated.
Client-Centric Decision Making. Recommendations are always aligned with the client’s goals and confirmed before implementation.
Summary
This project demonstrates how cross-format deduplication can be executed effectively even when core systems do not support it natively. By extending workflows beyond platform limitations and combining automated detection with expert validation, the dataset was streamlined and delivered within a compressed timeline without compromising accuracy, completeness, or defensibility.
Author: Paulo Santos and Jefferson Abada
