Home Back

Why Using a Generic LLM for Enrichment Simply Doesn’t Work

finextra.com 2 days ago

Thinking about using a generic LLM, such as ChatGPT, to enrich customer transaction data? Think again.

A specialized enrichment LLM combined with a market database outperforms a generic LLM.  

Over the past 18 months, many financial institutions have used ChatGPT (or a competitor) to get a better understanding of transactions processed. Often, the results were encouraging, with the LLM returning a cleaner, easier-to-understand transaction description or even identifying a merchant. However, there are some fundamental issues with this approach and the most important of those is the quality of the output.

Generic LLMs are natural language-based and work best when dealing with language. The transaction data might resemble natural language but, in reality, it’s generated by multiple processors in an ever-changing merchant landscape. This poses a problem for generic LLMs because, to deal with it, they need to be trained on the specific data points available in transactions.

But this is not the most problematic aspect of generic LLM’s output quality. LLMs are prone to hallucinations and this is especially dangerous when applying transaction enrichment to use cases such as risk monitoring, affordability checks, or dispute management. It’s a far worse outcome to provide wrong information (like the wrong transaction merchant) than to return no answer, leaving it for additional investigations.

Depending on the specific usage patterns, hallucinations can occur when dealing with simple tasks like categorization. When tested, LLMs like ChatGPT4 might return a made-up category depending on how much data has been provided and the context. The output in those cases is not necessarily wrong, but inconsistent. The lack of consistency remains an issue beyond hallucinations.

The hallucination issue is most acute in the ‘long tail’ of transactions with less popular merchants, which are also more difficult to recognize by the customer, making them more likely to be the subject of an inquiry. What also complicates the situation is occasionally random output from generic LLMs that are often modified or updated on an ongoing basis, providing sometimes unpredictable results.

The unpredictable results often manifest as random changes to taxonomy, the model not following the instructions. More challenges exist, varying from practical ones like limited context window, and lack of control over the training data (which among other things means that changes to the merchant market are rarely reflected in a timely fashion) to explainability. Also, tools like ChatGPT might introduce bias due to the data they were trained on

Aside from the quality of the outcomes, it’s important to ensure data is only retained in the places you expect, i.e. your systems or agreed third-party systems. Open, generic LLMs can, in some instances, make that impossible. Transactions often contain sensitive data and in some cases personally identifiable information.  It’s common for generic LLM providers to retain information provided by users and in some cases even use it for training. 

In our testing, Bud’s dedicated, specialized enrichment LLM combined with a market database comfortably outperforms a generic LLM. Specifically, the rate of false positives is considerably lower, there are no performance penalties due to the need to maintain/refresh context and the output is consistent with the expected taxonomy. Also, there is a guarantee that in case a merchant is not identified, there will be no hallucinated output (e.g. a made-up business).

As of today, the state-of-the-art approach to financial transaction enrichment does not include a generic LLM and it’s not obvious whether generic LLMs will be able to catch up in a predictable future. We expect the future to see growth in specialized models and this is exactly what Bud provides.

People are also reading