Home Back

How Enterprises Can Safely Use Unstructured Data With LLMs

Forbes 3 days ago

is CEO of cybersecurity and data protection infrastructure firm SECURITI and ex-head of Symantec’s cloud security division.

Generative AI (GenAI) adoption is no longer a choice—it’s a necessity for every major organization looking to survive and thrive in a competitive global economy. Recent research from Gartner shows that GenAI is now the most commonly deployed type of AI within organizations.

To keep up with their peers in the generative AI era, enterprises will have to effectively and safely leverage their unstructured data.

Considerations For Safely Leveraging Unstructured Data For GenAI

Large language models (LLMs) mostly leverage unstructured data, such as text and media. Most existing enterprise data management technologies are designed to manage structured data, which is organized in traditional databases or formal schemas.

The use of unstructured data in GenAI introduces new types of governance, privacy and security risks that these traditional data management tools aren’t equipped to handle. To safely leverage GenAI, organizations require a radically different approach to governing unstructured data.

The approach should be built on these 10 core guidelines.

1. Gain deep contextual insights into unstructured data.

To ensure safe usage of GenAI, organizations must have a full view of their unstructured data’s context.

This includes where it is located, who has access to it, which policies apply and what the relevant regulations are. Without this context, enterprises are essentially flying blind when they leverage LLMs.

2. Discover, catalog and classify unstructured data.

Traditional data catalogs excel at handling structured data but encounter difficulties with unstructured data. Machine learning (ML) tools can help organizations automatically discover, catalog and classify files and objects, which are essential for GenAI projects.

3. Preserve entitlements of unstructured data.

Proper access management is a fundamental part of cybersecurity. Organizations need to preserve their existing enterprise entitlements at source systems to ensure that only authorized users access relevant data via GenAI prompts.

4. Trace the lineage of unstructured data.

It's important to understand data mapping from the source to the end results. This process shows how the data moves from the unstructured data systems to vector databases to LLMs and, finally, to the endpoints.

Mapping provides an end-to-end view of data flow similar to the data lineage of structured data. It assists in monitoring the feeding of GenAI models and identifies contributing sources. It also verifies the integrity of responses from these models.

5. Automate the curation of unstructured data.

GenAI models use massive volumes of unstructured data. Automating the labeling or tagging of files can help to ensure that only relevant data is used in GenAI projects, safeguarding against unintentional use or exposure of sensitive information.

6. Extract unstructured data from diverse formats.

Data should be extracted from various unstructured formats—such as Word, PowerPoint, Excel, HTML, PDF and multimedia files (images, audio, video)—to improve its utilization. High-fidelity parsing captures a document's visual layout, aiding chunking for vectorization and enhancing LLM understanding.

This comprehensive extraction can provide LLMs with a complete and accurate understanding of the data for more effective processing.

7. Sanitize unstructured data.

Unstructured data must be sanitized before it is sent to GenAI models. This data can contain personal information (PI), personally identifiable information (PII) and other sensitive information. There is always a risk of exposing this data accidentally.

If GenAI models learn from any sensitive information, it remains with them forever, compromising data privacy. It's crucial to classify, redact and mask sensitive data from files that the GenAI projects use.

8. Focus on the quality of unstructured data.

Data quality in the traditional sense of accuracy or completeness does not apply to unstructured data.

Focusing on the freshness of data, deduplication to remove repeated occurrences, relevance to the topic and reliability of sources helps prevent unintended data from being used for GenAI projects.

9. Secure unstructured prompts, retrievals and responses.

GenAI assistants, bots and agents are vulnerable to malicious use and attacks. Businesses should look to the appropriate tools to automatically detect, classify and redact sensitive information on the fly, block toxic content and enforce compliance with topic and tone guidelines.

These may include technologies like context-aware LLM firewalls to protect prompts, retrievals and responses.

10. Ensure compliance while handling unstructured data.

GenAI uses large volumes of unstructured data, which can contain sensitive information and be a privacy minefield. Enterprises must be vigilant that their GenAI models are leveraging data in compliance with all global data and AI standards, such as GDPR, HIPAA, PCI DSS, the EU AI Act and NIST AI RMF.

GenAI should empower—not frighten—organizations.

GenAI has taken off at an incredible pace. Implementing any radical new technology can be intimidating, and LLMs are no exception. With its reliance on massive volumes of unstructured data, GenAI certainly comes with its share of risk.

If organizations embrace the data management tips for unstructured data outlined here, they will successfully operationalize GenAI to drive efficiency, growth and innovation.

Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives.Do I qualify?

People are also reading