Build financial search applications using the Amazon Bedrock Cohere multilingual embedding model

Posted on 12 January 2024 by urdupoint.live

Enterprises have access to massive amounts of data, much of which is difficult to discover because the data is unstructured. Conventional approaches to analyzing unstructured data use keyword or synonym matching. They don’t capture the full context of a document, making them less effective in dealing with unstructured data.

In contrast, text embeddings use machine learning (ML) capabilities to capture the meaning of unstructured data. Embeddings are generated by representational language models that translate text into numerical vectors and encode contextual information in a document. This enables applications such as semantic search, Retrieval Augmented Generation (RAG), topic modeling, and text classification.

For example, in the financial services industry, applications include extracting insights from earnings reports, searching for information from financial statements, and analyzing sentiment about stocks and markets found in financial news. Text embeddings enable industry professionals to extract insights from documents, minimize errors, and increase their performance.

In this post, we showcase an application that can search and query across financial news in different languages using Cohere’s Embed and Rerank models with Amazon Bedrock.

Cohere’s multilingual embedding model

Cohere is a leading enterprise AI platform that builds world-class large language models (LLMs) and LLM-powered solutions that allow computers to search, capture meaning, and converse in text. They provide ease of use and strong security and privacy controls.

Cohere’s multilingual embedding model generates vector representations of documents for over 100 languages and is available on Amazon Bedrock. This allows AWS customers to access it as an API, which eliminates the need to manage the underlying infrastructure and ensures that sensitive information remains securely managed and protected.

The multilingual model groups text with similar meanings by assigning them positions that are close to each other in a semantic vector space. With a multilingual embedding model, developers can process text in multiple languages without the need to switch between different models, as illustrated in the following figure. This makes processing more efficient and improves performance for multilingual applications.

The following are some of the highlights of Cohere’s embedding model:

Focus on document quality – Typical embedding models are trained to measure similarity between documents, but Cohere’s model also measures document quality
Better retrieval for RAG applications – RAG applications require a good retrieval system, which Cohere’s embedding model excels at
Cost-efficient data compression – Cohere uses a special, compression-aware training method, resulting in substantial cost savings for your vector database

Use cases for text embedding

Text embeddings turn unstructured data into a structured form. This allows you to objectively compare, dissect, and derive insights from all of these documents. The following are example use cases that Cohere’s embedding model enables:

Semantic search – Enables powerful search applications when coupled with a vector database, with excellent relevance based on search phrase meaning
Search engine for a larger system – Finds and retrieves the most relevant information from connected enterprise data sources for RAG systems
Text classification – Supports intent recognition, sentiment analysis, and advanced document analysis
Topic modeling – Turns a collection of documents into distinct clusters to uncover emerging topics and themes

Enhanced search systems with Rerank

In enterprises where conventional keyword search systems are already present, how do you introduce modern semantic search capabilities? For such systems that have been part of a company’s information architecture for a long time, a complete migration to an embeddings-based approach is, in many cases, just not feasible.

Cohere’s Rerank endpoint is designed to bridge this gap. It acts as the second stage of a search flow to provide a ranking of relevant documents per a user’s query. Enterprises can retain an existing keyword (or even semantic) system for the first-stage retrieval and boost the quality of search results with the Rerank endpoint in the second-stage reranking.

Rerank provides a fast and straightforward option for improving search results by introducing semantic search technology into a user’s stack with a single line of code. The endpoint also comes with multilingual support. The following figure illustrates the retrieval and reranking workflow.

Solution overview

Financial analysts need to digest a lot of content, such as financial publications and news media, in order to stay informed. According to the Association for Financial Professionals (AFP), financial analysts spend 75% of their time gathering data or administering the process instead of added-value analysis. Finding the answer to a question across a variety of sources and documents is time-intensive and tedious work. The Cohere embedding model helps analysts quickly search across numerous article titles in multiple languages to find and rank the articles that are most relevant to a particular query, saving an enormous amount of time and effort.

In the following use case example, we showcase how Cohere’s Embed model searches and queries across financial news in different languages in one unique pipeline. Then we demonstrate how adding Rerank to your embeddings retrieval (or adding it to a legacy lexical search) can further improve results.

The supporting notebook is available on GitHub.

The following diagram illustrates the workflow of the application.

Enable model access through Amazon Bedrock

Amazon Bedrock users need to request access to models to make them available for use. To request access to additional models, choose Model access the navigation pane on the Amazon Bedrock console. For more information, see Model access. For this walkthrough, you need to request access to the Cohere Embed Multilingual model.

Install packages and import modules

First, we install the necessary packages and import the modules we’ll use in this example:

!pip install --upgrade cohere-aws hnswlib translate

import pandas as pd
import cohere_aws
import hnswlib
import os
import re
import boto3

Import documents

We use a dataset (MultiFIN) containing a list of real-world article headlines covering 15 languages (English, Turkish, Danish, Spanish, Polish, Greek, Finnish, Hebrew, Japanese, Hungarian, Norwegian, Russian, Italian, Icelandic, and Swedish). This is an open source dataset curated for financial natural language processing (NLP) and is available on a GitHub repository.

In our case, we’ve created a CSV file with MultiFIN’s data as well as a column with translations. We don’t use this column to feed the model; we use it to help us follow along when we print the results for those who don’t speak Danish or Spanish. We point to that CSV to create our dataframe:

url = "
df = pd.read_csv(url)

# Inspect dataset
df.head(5)

Select a list of documents to query

MultiFIN has over 6,000 records in 15 different languages. For our example use case, we focus on three languages: English, Spanish, and Danish. We also sort the headers by length and pick the longest ones.

Because we’re picking the longest articles, we ensure the length is not due to repeated sequences. The following code shows an example where that is the case. We will clean that up.

df['text'].iloc[2215]

'El 86% de las empresas españolas comprometidas con los Objetivos de Desarrollo 
Sostenible comprometidas con los Objetivos de Desarrollo Sostenible comprometidas 
con los Objetivos de Desarrollo Sostenible comprometidas con los Objetivos de 
Desarrollo Sostenible'

# Ensure there is no duplicated text in the headers
def remove_duplicates(text):
    return re.sub(r'((\b\w+\b.{1,2}\w+\b)+).+\1', r'\1', text, flags=re.I)

df ['text'] = df['text'].apply(remove_duplicates)

# Keep only selected languages
languages = ['English', 'Spanish', 'Danish']
df = df.loc[df['lang'].isin(languages)]

# Pick the top 80 longest articles
df['text_length'] = df['text'].str.len()
df.sort_values(by=['text_length'], ascending=False, inplace=True)
top_80_df = df[:80]

# Language distribution
top_80_df['lang'].value_counts()

Our list of documents is nicely distributed across the three languages:

lang
Spanish    33
English    29
Danish     18
Name: count, dtype: int64

The following is the longest article header in our dataset:

top_80_df['text'].iloc[0]

"CFOdirect: Resultater fra PwC's Employee Engagement Landscape Survey, herunder hvordan 
man skaber mere engagement blandt medarbejdere. Læs desuden om de regnskabsmæssige 
konsekvenser for indkomstskat ifbm. Brexit"

Embed and index documents

Now, we want to embed our documents and store the embeddings. The embeddings are very large vectors that encapsulate the semantic meaning of our document. In particular, we use Cohere’s embed-multilingual-v3.0 model, which creates embeddings with 1,024 dimensions.

When a query is passed, we also embed the query and use the hnswlib library to find the closest neighbors.

It only takes a few lines of code to establish a Cohere client, embed the documents, and create the search index. We also keep track of the language and translation of the document to enrich the display of the results.

# Establish Cohere client
co = cohere_aws.Client(mode=cohere_aws.Mode.BEDROCK)
model_id = "cohere.embed-multilingual-v3"

# Embed documents
docs = top_80_df['text'].to_list()
docs_lang = top_80_df['lang'].to_list()
translated_docs = top_80_df['translated_text'].to_list() #for reference when returning non-English results
doc_embs = co.embed(texts=docs, model_id=model_id, input_type="search_document").embeddings

# Create a search index
index = hnswlib.Index(space="ip", dim=1024)
index.init_index(max_elements=len(doc_embs), ef_construction=512, M=64)
index.add_items(doc_embs, list(range(len(doc_embs))))

Build a retrieval system

Next, we build a function that takes a query as input, embeds it, and finds the four headers more closely related to it:

# Retrieval of 4 closest docs to query
def retrieval(query):
    # Embed query and retrieve results
    query_emb = co.embed(texts=[query], model_id=model_id, input_type="search_query").embeddings
    doc_ids = index.knn_query(query_emb, k=3)[0][0] # we will retrieve 4 closest neighbors
    
    # Print and append results
    print(f"QUERY: {query.upper()} \n")
    retrieved_docs, translated_retrieved_docs = [], []
    
    for doc_id in doc_ids:
        # Append results
        retrieved_docs.append(docs[doc_id])
        translated_retrieved_docs.append(translated_docs[doc_id])
    
        # Print results
        print(f"ORIGINAL ({docs_lang[doc_id]}): {docs[doc_id]}")
        if docs_lang[doc_id] != "English":
            print(f"TRANSLATION: {translated_docs[doc_id]} \n----")
        else:
            print("----")
    print("END OF RESULTS \n\n")
    return retrieved_docs, translated_retrieved_docs

Query the retrieval system

Let’s explore what our system does with a couple of different queries. We start with English:

queries = [
    "Are businessess meeting sustainability goals?",
    "Can data science help meet sustainability goals?"
]

for query in queries:
    retrieval(query)

The results are as follows:

QUERY: ARE BUSINESSES MEETING SUSTAINABILITY GOALS? 

ORIGINAL (English): Quality of business reporting on the Sustainable Development Goals 
improves, but has a long way to go to meet and drive targets.
----
ORIGINAL (English): Only 10 years to achieve Sustainable Development Goals but 
businesses remain on starting blocks for integration and progress
----
ORIGINAL (Spanish): Integrar los criterios ESG y el propósito en la estrategia 
principal reto de los Consejos de las empresas españolas en el mundo post-COVID 

TRANSLATION: Integrate ESG criteria and purpose into the main challenge strategy 
of the Boards of Spanish companies in the post-COVID world 
----
END OF RESULTS 

QUERY: CAN DATA SCIENCE HELP MEET SUSTAINABILITY GOALS? 

ORIGINAL (English): Using AI to better manage the environment could reduce greenhouse 
gas emissions, boost global GDP by up to 38m jobs by 2030
----
ORIGINAL (English): Quality of business reporting on the Sustainable Development Goals 
improves, but has a long way to go to meet and drive targets.
----
ORIGINAL (English): Only 10 years to achieve Sustainable Development Goals but 
businesses remain on starting blocks for integration and progress
----
END OF RESULTS

Notice the following:

We’re asking related, but slightly different questions, and the model is nuanced enough to present the most relevant results at the top.
Our model does not perform keyword-based search, but semantic search. Even if we’re using a term like “data science” instead of “AI,” our model is able to understand what’s being asked and return the most relevant result at the top.

How about a query in Danish? Let’s look at the following query:

query = "Hvor kan jeg finde den seneste danske boligplan?" # "Where can I find the latest Danish property plan?"
retrieved_docs, translated_retrieved_docs = retrieval(query)

QUERY: HVOR KAN JEG FINDE DEN SENESTE DANSKE BOLIGPLAN? 

ORIGINAL (Danish): Nyt fra CFOdirect: Ny PP&E-guide, FAQs om den nye leasingstandard, 
podcast om udfordringerne ved implementering af leasingstandarden og meget mere

TRANSLATION: New from CFOdirect: New PP&E guide, FAQs on the new leasing standard, 
podcast on the challenges of implementing the leasing standard and much more 
----
ORIGINAL (Danish): Lovforslag fremlagt om rentefri lån, udskudt frist for 
lønsumsafgift, førtidig udbetaling af skattekredit og loft på indestående på 
skattekontoen

TRANSLATION: Legislative proposal presented on interest-free loans, deferred payroll 
tax deadline, early payment of tax credit and ceiling on deposits in the tax account 
----
ORIGINAL (Danish): Nyt fra CFOdirect: Shareholder-spørgsmål til ledelsen, SEC 
cybersikkerhedsguide, den amerikanske skattereform og meget mere

TRANSLATION: New from CFOdirect: Shareholder questions for management, the SEC 
cybersecurity guide, US tax reform and more 
----
END OF RESULTS

In the preceding example, the English acronym “PP&E” stands for “property, plant, and equipment,” and our model was able to connect it to our query.

In this case, all returned results are in Danish, but the model can return a document in a language other than the query if its semantic meaning is closer. We have complete flexibility, and with a few lines of code, we can specify whether the model should only look at documents in the language of the query, or whether it should look at all documents.

Improve results with Cohere Rerank

Embeddings are very powerful. However, we’re now going to look at how to refine our results even further with Cohere’s Rerank endpoint, which has been trained to score the relevancy of documents against a query.

Another advantage of Rerank is that it can work on top of a legacy keyword search engine. You don’t have to change to a vector database or make drastic changes to your infrastructure, and it only takes a few lines of code. Rerank is available in Amazon SageMaker.

Let’s try a new query. We use SageMaker this time:

query = "Are companies ready for the next down market?"
retrieved_docs, translated_retrieved_docs = retrieval(query)

QUERY: ARE COMPANIES READY FOR THE NEXT DOWN MARKET? 

ORIGINAL (Spanish): El valor en bolsa de las 100 mayores empresas cotizadas cae un 15% 
entre enero y marzo pero aguanta el embate del COVID-19 

TRANSLATION: The stock market value of the 100 largest listed companies falls 15% 
between January and March but withstands the onslaught of COVID-19 
----
ORIGINAL (English): 69% of business leaders have experienced a corporate crisis in the 
last five years yet 29% of companies have no staff dedicated to crisis preparedness
----
ORIGINAL (English): As work sites slowly start to reopen, CFOs are concerned about the 
global economy and a potential new COVID-19 wave - PwC survey
----
END OF RESULTS

In this case, a semantic search was able to retrieve our answer and display it in the results, but it’s not at the top. However, when we pass the query again to our Rerank endpoint with the list of docs retrieved, Rerank is able to surface the most relevant document at the top.

First, we create the client and the Rerank endpoint:

# map model package arn
import boto3
cohere_package = "cohere-rerank-multilingual-v2--8b26a507962f3adb98ea9ac44cb70be1" # replace this with your info

model_package_map = {
    "us-east-1": f"arn:aws:sagemaker:us-east-1:865070037744:model-package/{cohere_package}",
    "us-east-2": f"arn:aws:sagemaker:us-east-2:057799348421:model-package/{cohere_package}",
    "us-west-1": f"arn:aws:sagemaker:us-west-1:382657785993:model-package/{cohere_package}",
    "us-west-2": f"arn:aws:sagemaker:us-west-2:594846645681:model-package/{cohere_package}",
    "ca-central-1": f"arn:aws:sagemaker:ca-central-1:470592106596:model-package/{cohere_package}",
    "eu-central-1": f"arn:aws:sagemaker:eu-central-1:446921602837:model-package/{cohere_package}",
    "eu-west-1": f"arn:aws:sagemaker:eu-west-1:985815980388:model-package/{cohere_package}",
    "eu-west-2": f"arn:aws:sagemaker:eu-west-2:856760150666:model-package/{cohere_package}",
    "eu-west-3": f"arn:aws:sagemaker:eu-west-3:843114510376:model-package/{cohere_package}",
    "eu-north-1": f"arn:aws:sagemaker:eu-north-1:136758871317:model-package/{cohere_package}",
    "ap-southeast-1": f"arn:aws:sagemaker:ap-southeast-1:192199979996:model-package/{cohere_package}",
    "ap-southeast-2": f"arn:aws:sagemaker:ap-southeast-2:666831318237:model-package/{cohere_package}",
    "ap-northeast-2": f"arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/{cohere_package}",
    "ap-northeast-1": f"arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/{cohere_package}",
    "ap-south-1": f"arn:aws:sagemaker:ap-south-1:077584701553:model-package/{cohere_package}",
    "sa-east-1": f"arn:aws:sagemaker:sa-east-1:270155090741:model-package/{cohere_package}",
}

region = boto3.Session().region_name
if region not in model_package_map.keys():
    raise Exception(f"Current boto3 session region {region} is not supported.")

model_package_arn = model_package_map[region]

co = cohere_aws.Client(region_name=region)
co.create_endpoint(arn=model_package_arn, endpoint_name="cohere-rerank-multilingual", instance_type="ml.g4dn.xlarge", n_instances=1)

When we pass the documents to Rerank, the model is able to pick the most relevant one accurately:

results = co.rerank(query=query, documents=retrieved_docs, top_n=1)

for hit in results:
    print(hit.document['text'])

69% of business leaders have experienced a corporate crisis in the last five years yet 
29% of companies have no staff dedicated to crisis preparedness

Conclusion

This post presented a walkthrough of using Cohere’s multilingual embedding model in Amazon Bedrock in the financial services domain. In particular, we demonstrated an example of a multilingual financial articles search application. We saw how the embedding model enables efficient and accurate discovery of information, thereby boosting the productivity and output quality of an analyst.

Cohere’s multilingual embedding model supports over 100 languages. It removes the complexity of building applications that require working with a corpus of documents in different languages. The Cohere Embed model is trained to deliver results in real-world applications. It handles noisy data as inputs, adapts to complex RAG systems, and delivers cost-efficiency from its compression-aware training method.

Start building with Cohere’s multilingual embedding model in Amazon Bedrock today.

About the Authors

James Yi is a Senior AI/ML Partner Solutions Architect in the Technology Partners COE Tech team at Amazon Web Services. He is passionate about working with enterprise customers and partners to design, deploy, and scale AI/ML applications to derive business value. Outside of work, he enjoys playing soccer, traveling, and spending time with his family.

Gonzalo Betegon is a Solutions Architect at Cohere, a provider of cutting-edge natural language processing technology. He helps organizations address their business needs through the deployment of large language models.

Meor Amer is a Developer Advocate at Cohere, a provider of cutting-edge natural language processing (NLP) technology. He helps developers build cutting-edge applications with Cohere’s Large Language Models (LLMs).