All Blog Posts
April 16, 2018

Predictive Coding and Its Impact on E-Discovery

Posted By
aureus tech systems

Predictive Coding and Its Impact on E-Discovery

For those working in the legal profession, electronic discovery is one large component of the job. Electronic discovery includes the following stages:

  • Identification: Identifying which documents will be potentially responsive (relevant to a case). 
  • Preservation: Making sure these documents are preserved and not destroyed.
  • Collection: Transferring the documents to the attorney's location (computer etc).
  • Processing: Making the documents available to review, print, and so on.
  • Review: Reviewing the documents to ensure the validity and necessity of them for court.

Gaining evidence through the discovery process is necessary to formulate a credible argument in court. The greater amount of evidence and information attorneys have, the more prepared they are to argue their case.

Extracting relevant information to be used as evidence in court is a process that is made simpler in the electronic age. For one thing, you cannot destroy evidence very easily. Once it's in an electronic format and gets to a network, it's almost impossible to eliminate. You've probably heard different news stories about criminals trying to erase computer or cell phone data, but officials have ways of retrieving it. However, due to the vast amount of data floating around the internet, most attorneys use predictive coding software to aid in the e-discovery process. What impact does predictive coding have in the e-discovery process? First, let's take a closer look at what is involved with predictive coding and how it works. 

Predictive coding and machine learning

In simple terms, predictive coding relies on machine learning processes that programmers have been using in computer software for years. Machine learning is, in a sense, programming the computer to learn what to do in the future based on what it has done in the past. Programmers write software using keywords or phrases that then guide it in how to respond to future keywords and phrases. The computer would then take an action based on what the future keywords were. However, the learning doesn't stop with the initial written program. As the user interacts with the program, the software is programmed to continually make changes or modify its behavior.

An excellent example of machine learning in action is how spam works in email. Programmers write the coding that tells the computer what emails "look like" when they are spam. They do this by including several keywords and phrases indicating a particular email is a spam. The action the software is to take is to put the spam in a special folder and not your inbox folder. Every time you get an email in your inbox that IS spam and you flag it as such, the software "learns" more information. It adds the keywords or sender to its already extensive list of keywords.

Now, where does predictive coding come into play? Predictive coding is a form of machine learning. However, it is specifically used for e-discovery. The technical definition of prediction coding is as follows: "Predictive coding is the use of keyword search, filtering and sampling to automate portions of an e-discovery document review."

How does predictive coding work?

Predictive coding uses the same machine learning techniques just mentioned but applies it a bit differently and more extensively. People enter a relevant list of keywords and phrases that are then used to determine what is classified as responsive or unresponsive for future data. How does this work in e-discovery? The short answer is the software performs searches based on parameters and keywords the person inputs and then applies that information to all future data in the search process. Therefore, it extracts evidentiary information relevant to the case. Here are the particulars of the steps to predictive coding.

Defining the seed set

A seed set is considered the sample set or training set. It's your starter set. The seed set may consist of a "judgmental sample or a random sample." A judgmental sample is defined by what the person creating it deems as important or relevant. This means the discovery of evidence, in this case, will be driven by what the lawyer deems relevant. A random seed set would pull a cross-section of data from the larger population of data. Although both methods are allowable, in a court ruling, Rio Tinto Plc v. Vale S.A., the judge advised transparency for those who use predictive coding as a means for e-discovery. 

Labeling the set

After pulling the seed set, it is then labeled as either a responsive or unresponsive document. This simply means it fits what you are looking for or it does not fit. The software needs to have examples of what is considered relevant. This helps prepare the software for the next step.

Creating the formula for future sets

Now the formula can be created as the software analyzes how the seed set is defined. It reads what is responsive text and what is unresponsive text and it can act accordingly. The software develops the internal algorithm based on how the sets were labeled. That way when it comes across what was defined as relevant, it will pull it out as a responsive document and one that the attorney wants.

Test and repeat

The final step involves testing your results to ensure they are what you expected. If you are getting back information that is irrelevant, then further refinement needs to be done. As you flag the incorrect results, the software will get smarter. Machine learning will kick in and results begin to become more accurate. Now, when you perform searches, you discover only relevant data to your case.

Is predictive coding accepted as reliable?

Most people who use predictive coding consider it a reliable source of discovery. In 2012 in the case of Da Silva Moore v. Publicis Groupe, Judge Andrew Peck recognized the validity of predictive coding as a means of discovery. There are however disagreements over how to go about defining the seed set. Some argue that transparency is necessary for the entire process, which means attorneys would explain how they defined their seed set – what parameters they worked within to gain the data. Others argue that doing this "fails to recognize that a seed set may reflect a lawyer's perceptions of relevance, litigation tactics, or even its trial strategy."