Cracking the Code for Unstructured Data Classification

This blog was adapted from the original article on Built In.

adobestock_315414440-003-1-2048x1364

Every enterprise, no matter the sector or size, is dealing with the same issue: unstructured data chaos. There’s too much of it, it’s growing too quickly and it’s becoming unaffordable. It’s also one of the most valuable assets that companies possess. Yet its sheer size–multiple petabytes in midsize to large organizations–and distribution across on-premises, edge and cloud storage, makes it tough to leverage.

According to 2023 research by IDC, organizations analyze less than half of their unstructured data to extract value, and they also reuse less than half of said data.

Why Unstructured Data Classification Matters

Unstructured data classification is important because it adds structure to unstructured data – which makes it more findable and usable across the organization. Classification starts with the metadata that’s automatically generated by data storage technology.

System-generated metadata includes information about when the data was created, who created it, its type, its size, when it was last accessed and when it was last modified. This helps IT managers classify data by the department it belongs to and identify rarely accessed data as ready for archiving and tiering to lower-cost storage destinations. IT professionals can also search based on data types, such as video or medical imaging files, which may be consuming too much storage (and budget) and require action such as migration.

What Capabilities Do Your Tools Need?

For additional unstructured data classification, it’s important to enrich metadata using tools that can crack open file contents to search for keywords or data types. This includes searching sensitive personal identifiable information, particular items in an image or videos with specific content. These tools may incorporate AI or machine learning technology to rapidly scan across file shares and directories to identify matches, but they usually can’t store this information.

Unstructured data management solutions, however, can fill this critical gap by feeding the right data to the AI/ML indexers and tagging the outcomes of those AI scans. This delivers more metadata that can be readily searched and brings value in many ways. Given the sheer size of unstructured data and its siloed nature, automation is imperative to enrich the metadata needed for classification.

Use Cases for Data Classification

Security and Privacy: Data classification is critical to discover personally identifiable information, IP and other sensitive data that may be hidden or has been copied and stored in noncompliant locations. An organization can apply levels of security classification too, such as low, medium or high risk.

Audits and E-discovery
Some organizations have regular audits, such as for proper management of financial or personal health information data, which requires IT to work with auditors and demonstrate compliance. Without classification and segmentation of audited data, an organization may face heavy manual work to locate audited data. For e-discovery, which happens out of the blue, a company may need to quickly locate and copy security video footage to facilitate an investigation, for instance.

Data Retention
Industry or corporate rules may dictate the retention of files for a period. Searching metadata for file type, such as medical images, and time of creation, IT can find files that are prime for deletion. This also saves money by avoiding the endless storage of data that is no longer needed or required. Komprise Smart Data Workflows can allow IT to create workflows that discover and confine or delete files by policy.

Cost Savings
Data classification by age and time of last access is a smart way to find data that is rarely accessed, or “cold,” and move it to archival storage where it can be retained for as long as necessary — at a fraction of the cost. Metadata indicating file type, such as instrument or research data, further informs long-term storage strategies. Learn more about Komprise Analysis here.

Search and AI
Deep classification of unstructured data sets, such as by keyword or project name, helps employees can find what they need without bugging IT. They can then feed it to analytics tools or other applications as needed. For instance, healthcare analysts may want to run a study of breast cancer images from a certain demographic and with a particular diagnosis code. Enriching metadata with these tags in a policy-driven, automated way means that the required data sets are always updated and easy to locate by researchers.

Data Governance for AI
IT and security teams can tag and segment proprietary data sets which are banned from ingestion by AI tools, as well. This is an important consideration when using GenAI tools in the public domain, since sensitive and protected data can be easily and unwittingly leaked into training models.

White-paper-Unstructured-Data-Management-In-the-Age-of-Generative-AI_Website-Featured-Image_1200px-x-600px

Unstructured data classification is no longer a nice-to-have capability: it is a requirement to manage the risks of uncontrolled, distributed data. It allows storage managers to deliver more services to the broader organization — whether that is to supplement data security and privacy needs, lower storage costs or deliver a Google-like search experience to find, tag and move precise data sets to data lakes and AI tools for analysis.

Getting Started with Komprise:

Contact | Data Assessment