The purpose of predictive coding is to assist users with the coding and categorization of a large number of documents, thus speeding up the review process. Traditionally, a high level reviewer would review hundreds if not thousands of documents and code each one individually. Predictive coding provides a workflow for the high level reviewer to review a portion of the documents thus “teaching” the system how to code other documents. The system works by taking documents coded by the user and applies similar coding to documents with similar content.
Attached at the bottom of this article is a document containing additional information regarding AccessData's approach to Technology Assisted Review (TAR).
The predictive coding workflow has three basic phases: teach the system, apply the systems learning, and perform quality control. Reports generated during the predictive coding process provide information about how the system performed.
Teach the system
The process of teaching the system begins with the reviewer defining a set of documents. Normally the reviewer would narrow the scope of documents to be reviewed by filtering on metadata and/or keywords. The reviewer evaluates each of these documents individually and determines responsiveness to the pre-determined subject matter. To assist in system learning, the reviewer should also provide additional keywords encountered while reviewing the document that supported the reviewer’s decision to code the document either responsive or non-responsive to the subject matter (keywords are currently not integrated with the main algorithm). This process leverages the existing Summation product’s coding and tagging process using a pre-defined “Predictive Coding” layout.
After reviewing a subset of documents, the reviewer may test the system to obtain a confidence level of how well the system has learned. The confidence level is determined by using cross-validation. The confidence level helps the reviewer determine when the system learning is good enough to be applied to the rest of the document set. The confidence level is generated and displayed using the “Predictive Coding Confidence” layout panel.
Note: You cannot bulk code the ReviewResponsiveness column. The industry accepted ideology behind predictive coding is that the entire seed set be manually reviewed, thus the ability to code in bulk is not available.
Apply the system learning
The reviewer determines the set of documents for the system to code. This can be done by applying filters to the document list. The predictive coding process can be initiated on all, checked or unchecked items in the document list. Documents included in this set that were used to train the system will not be re-coded by the system. Once the predictive coding process is complete, a report is generated with information about how many documents were in the learning set, how many documents where coded by the system, how many documents were coded responsive, how many documents were coded non-responsive and information about the confidence level at the start the predictive coding process.
The Iterative Dichotomiser 3 (ID3) algorithm is used to generate the decision tree used by the system in performing predictive coding. The input attributes for each document are generated by Cluster Analysis's (Sally’s) feature extraction performed during evidence processing. The input attributes list can be refined further using feature selection algorithms and a size limit by changing the configuration file (most frequent, Information gain). The default is to use all attributes of the documents. Another algorithm configuration that can be changed is the maximum tree depth.
Cross-validation is the process used to determine the confidence level of the system. In this process, the original sample is randomly partitioned into k subsamples (Validation Folds). Of the k subsamples, n subsample is retained as the validation data for testing the model, and the remaining k minus n subsamples are used as training data. The validation process is then repeated k times (the folds), with different set of n subsamples being used as the validation data. The results from the folds are then averaged to produce a single estimation. The default configuration has k set to 10 and n to 9.
Confidence level or Confidence score is defined using the statistical calculation known F1 score. It is a calculated using precision rate (true positive count over total positive labeled) and recall rate (true positive count over total positive count).
The initial implementation in Summation Pro/eDiscovery only allows for one “code” to be applied, "responsive" or "non-responsive," in a project.
Please refer to the Summation Pro/eDiscovery user guide and reviewer guide for further information about predictive coding.