20 Commonly Used EDiscovery Processing Terms Explained
When it comes to data processing and eDiscovery workflows there are various buzzwords and terms that are commonly tossed around in normal conversion. For those unfamiliar with the data handling and processing workflows, these can often be confusing. To aid in that, here are 20 common terms that are used by eDiscovery professionals that relate to the processing stage of the EDRM and overall electronic discovery process.
TERM | Explanation |
Data Ingestion | Data Ingestion, sometimes referred to as Data “In”, is the process of loading data from an original or post-collection source into a storage medium or technology where it can be assessed, analyzed, and used for data processing. |
Data Out | Data Out, also known as data export, is the exporting of the post-processed and filtered datasets for use within review platforms or further data analysis and reporting. |
Culling | Culling is the act of reducing and controlling the size of a data set through the application of various data reduction and filtering workflows such as de-duplication and de-NISTing. |
De-NIST | De-NISTing is the standard eDiscovery practice of removing system files, and other non-user generated ESI from a dataset during the data processing stage. NIST refers to the acronym of National Institute of Standards and Technology. The purpose of De-NISTing is to remove ESI that is unlikely to contain relevant information, to in turn reduce the size of dataset and overall cost of hosting and reviewing the dataset. |
File Type Filter | File Type Filtering is the workflow of filtering out, or filtering to find specific file types such as .PDF, .DOC, .PPT, .TXT, .HTML, .HTM and more. There are an extensive amount of different file types and variations of file types, with new file types being introduced as new software and technology is created and released. |
Date Range Filter | Date Range Filtering is the process of filtering and limiting datasets to a specific period of time. This is commonly done through incorporating a start date and end date, but can also be executed as created before a specified date or created after a specified date. |
De-Duplication | De-Duplication is the process of removing identical and duplicative ESI from a dataset. Identical ESI can be stored in multiple places at once. This is commonly seen in emails, where one email can be stored in multiple folders. Another common area is if one file is stored in multiple places such as on the desktop, within the native application, and in cloud hosted storage. |
Native File Processing | Native File Processing is when a file is processed in its native (original) format. This means that the file is not converted during or for the purpose of data processing. An example of this is if a PowerPoint file is processed, the data will remain is .PPT format and not be converted to a new file type for processing. The types of files available for native file processing is dependent on the technology. |
Batch Processing | The processing of a large amount of ESI in a single step. [1] |
Load File | A file that relates to a set of scanned images or electronically processed files, and indicates where individual pages or files belong together as documents, to include attachments, and where each document begins and ends. [2] |
Filtering | Filtering is the process of reducing a data set through the application of various filter criteria such as file type, date range, and search term. |
Hash Values | Hashes are unique alphanumeric values that are randomly generated through encryption that serve as an identifier assigned to specific documents. Hash Values are commonly referred to as the “digital fingerprint” of data. Hash values are used by computers to compare various ESI against each other, and also determine if documents are duplicates. |
Metadata | Metadata is a fact-based summary of a document detailing everything from the time of creation to the author or title of the document. A common saying is that metadata is data about data. |
OCR | OCR stands for Optical Character Recognition, which is the process of technology scanning a document, identifying text and characters within the document, and assigning searchable values to the text. OCR is commonly used to convert scanned images of physical pages into searchable electronic text. |
Structured Data | Electronically stored information that is maintained in a structured format such as a database. |
Unstructured Data | Electronically stored information that resides in various unstructured formats, such as free form text, social media activity, and rich media. |
ESI | ESI stands for Electronically Stored Information. This is the umbrella term that encompasses any and all data that is stored electronically. |
Byte | A unit of information stored in a computer, equal to 8 bits. A computer’s memory is measured in bytes. [3] |
PST | A PST (Personal Storage Table) file is a file generated by an email system that contains your messages, calendars, contacts, tasks, and other information relating to your email account that is stored on your device and in the application. |
TIFF | TIFF stands for Tagged Image File Format. A TIFF is essentially an image of how a document looks. TIFFs are a standardized file format that can be opened on nearly all devices. TIFF files are also commonly lossless images that do not contain file compression. |
Sources:
[1] Batch Processing. 2010. In The Sedona Conference® Glossary: E-Discovery & Digital Information Management (Third Edition). Retrieved from https://canons.sog.unc.edu/wp-content/uploads/2011/02/glossary2010.pdf
[2] Load File. 2010. In The Sedona Conference® Glossary: E-Discovery & Digital Information Management (Third Edition). Retrieved from https://canons.sog.unc.edu/wp-content/uploads/2011/02/glossary2010.pdf
[3] Byte. In Oxford Advanced Learner’s Dictionary at OxfordLearnersDictionaries.com. (n.d.) Retrieved from https://www.oxfordlearnersdictionaries.com/us/definition/english/byte