Friday, September 21, 2012

Information Classification

Information classification is the process of arranging information with shared characteristics.  It is a lot harder than what meets the eye.  And if it’s the employees classifying data (increasingly a non-issue as there is way too much information), it’s more like an art than a science and more like contextual guesstimating than measuring.  Even harder when a user must make the determination on what is a formal record (required for legal or regulatory reasons) – vs. what is not.  And there is a really good explanation as to why that’s the reality.

Today around the globe, employees do business, real business in Facebook, Twitter, blogs, SharePoint, text messages and email.  As email has been the business tool of choice for many years and as there are billions of them used in business every day, it’s a good place to start to explore just why having 100% exactitude in classifying is not a reality.

Let’s delve into an example to start to understand just how complicated the mere act of classifying information can be. 

Lily, the manager of the sales support unit gets the following email from Teddy, the head of the leasing business unit.  You make the call—is it a record and if so, how should it be classified?

“Thanks for overseeing the Ace Leasing deal.  I thought your assistant manager, Dylan did a good job and I think he is ready for bigger challenges and a boost in pay.  It would have been useful if he brought contracting in sooner.  We should really think about how to make the documentation process touch fewer hands and simpler over all.  Also, we need to get implementation services involved ASAP.  Please have Riley from contracting confirm the pricing, as it wasn’t on the attached proposal.

Best, Teddy. 
BTW-say hi to your daughter Cooper.”

This email or millions like it happen every day, all day long.  If you were asked to classify it, would you say it’s a record requiring long term retention? If you did, what kind of record is it?

If you had any employee determine what the business value of the email was, they could classify it many different, albeit CORRECT ways.  Most employees predictably classify information with a parochial perspective about what it is based on their work experience.  If Lily, the recipient classified it, she would be colored by the utility of the email for her job or department. In that case maybe it’s a sales record which should be put in the Ace Leasing file.  On the other hand, as a manager she may see it as an HR related record, which recommends advancing Dylan and getting him a pay raise.  Maybe it should even go into Lily’s personnel folder as being complimentary of her good management of her unit.  Maybe the email is a record for the contracting department or instructions for the implementation of the project. Maybe it’s also a record for the Business Process Improvement team to fix the business process as management thinks it’s broken.  Fact is it could properly be classified as all those types of records.  All different records have different retention periods associated with them.  And further depending upon who classifies and what business unit they are from, the result may be substantially different. 

Not surprisingly, employees are not particularly good at classifying information, even the smart ones, and if they don’t need to do it, they won’t, and don’t even care. Now imagine each employee touches 100 information nuggets daily that need classification.  This partly explains why classification is so difficult.  It also makes the point that there are many subjective right answers. I believe many records could be properly classified in different correct ways.  We sometime think there is only one right way.

For almost a decade I have been thinking about the use of auto-classification technology to classify and manage information.  I used to think it wasn’t ready for prime time.  Today it is really powerful when used properly.  I then got hung up on lawyers attacking it giving a known failure rate.  I got over that as they attack everything any way and reasonableness and information volumes dictate relying on technology to do the heavy classification lifting.  Given information volumes and expecting employees to do the classifying is like asking your auditors to count the grains of sand on the beach, and classify them according to size and shape.  And now I am down to how effective the technology has to be to allow your classification to be done by a computer.  There are no hard and fast rules about confidence ratings or efficacy scores (sometimes referred to as F-Score,) even though most people would be substantially comforted if there were simple rules for what was good or good enough.  

I know employees are not good at classification.  I know that employees don’t have time to do it and even if they did, they usually won’t get it right.  I know people classify information in different ways and rarely are consistent from employee to employee.  I know information volumes for most big businesses are growing at 20-50% per year.  I know computers can do classification.  I know it is not simple or cheap to do auto-classification.  I know it takes upfront effort to get auto-classification right.  I know that a company can’t dispose of business information without some diligence process to ensure that records are retained and evidence is preserved.  I know that I have concluded that every big business needs to consider defensible disposition of information using technology to make it happen.  In the end, I know people will attack the process and they will attack the auto-classification soft underbelly—the failure rate, the confidence score, the F-Score.  I used to think it had to be above 90% to be good enough. Then I thought well maybe 80% is good enough.

Well, I have changed my thinking because the paradigm bounding my thoughts on this topic is flawed. As the classification tool crawls, it uses linguistic and numerical analysis to determine what something is and how to properly classify it.  In the end if the software tells me it believes it’s correct with a confidence score of 51% or higher—what that means is the software probably got it right but maybe there is another category that is also a good option.  In the end people do exactly what the technology does, but we hold technology to a different and higher standard.  I am not sure what the right confidence score is, but I think we need to give technology a chance and not look for reasons to dismiss its utility. Nothing’s perfect, including your employees.