Semantic Search with Pinecone and Huggingface
Overview
Introduction
In today's data-driven world, the ability to swiftly and accurately mine vast swathes of information is crucial. Whether it's a legal firm sifting through thousands of case files to find precedents, a research institution analyzing academic papers for a specific keyword, or a corporation scanning its archives for strategic insights, semantic search can be a game-changer. Imagine not just looking for exact phrases but having the capability to understand and retrieve data based on the meaning of a query.
Recently, I decided to venture into the world of Pinecone, a SaaS vector database. While Pinecone provides a wide variety of examples, I noticed a slight gap in their semantic search guide, particularly regarding data transformation, storing, and subsequent retrieval using Huggingface.
Why Should You Care?
Semantic search isn't just a fancy tech term; it's a revolutionary approach to information retrieval. Traditional search methods often require exact matches, making them less flexible and potentially overlooking critical documents. In contrast, semantic search understands the nuances and contexts, ensuring that:
Legal Professionals can find relevant case files, even if the exact terms aren't used.
Researchers can identify valuable literature and data, even if buried deep within a document.
Businesses can tap into insights and trends by understanding the sentiment and context of their archived documents.
Government agencies can ensure they're always compliant, retrieving documents that pertain to specific regulations without manually trawling through piles of paperwork.
By harnessing the power of Pinecone and Huggingface, semantic search can be made more accessible, powerful, and user-friendly.
Diving Deeper
In this blog, I'll take a comprehensive look at the process outlined in the semantic-search-example. By the time you're done reading, you'll have insights into constructing a Pinecone index, primed to query and retrieve related content from a document dataset, right down to the specific document's filename. But before we jump in, let’s go over some important terminologies.
Terminology
Vector Database: Think of this as a specialized database designed to handle vector data efficiently. In layman's terms, it's like a magical vault where vectors - lists of numbers representing data - are stored, and you can easily find the most similar vectors to a new one you provide.
Vector Embeddings: This is just a fancy name for converting something (like a word or document) into a list of numbers, so computers can "understand" it better.
Semantic Search: Imagine typing a query and it not just searching for those exact words, but truly understanding the meaning behind your search, giving you more relevant results.
The Dataset
For our demonstration, we need a rich corpus of PDF files. The US Library of Congress graciously offers a bundle of 1,000 US government PDFs that fits our purpose perfectly. In our generateDataset.py
script, we've employed the PyPDF2 library to seamlessly extract the data and curate a dictionary.
With the text extracted, we use the sentence transformer all-MiniLM-L6V2 to generate the vector representation of our text, and save it as a zip file in disk. This zip will contain some metadata, as well as the *.arrow file which can be used by Huggingface datasets.
Uploading to Huggingface (optional)
If desired, you can upload a local dataset to Huggingface. You will need to have an account and a repository for this data, and before executing the code you should run
huggingface-cli login
to set the auth tokens to your environment.
This code is covered in uploadDatasetToHuggingface.py.
Pinecone Index
Brace yourselves, for this step was a tad challenging. Our mission? To craft a Pinecone index (if one doesn't already exist) and upsert our dataset. The original Pinecone example runs very smoothly, but that is because it was using a pinecone_dataset object with proper formatting and convenience functions. For a while I found myself treading murky waters as Pinecone has a specific appetite for data formats, but thankfully, everything is well documented, especially this illustrative diagram:
In summary I needed to format my data so it would look similar to this:
[
{'id': 'vec1',
'values': [0.1, 0.2, 0.3],
'metadata': {'pdf_file': 'AAAA.pdf', 'text': 'DDD'},
'sparse_values': {
'indices': [10, 45, 16],
'values': [0.5, 0.5, 0.2]
}},
{'id': 'vec2',
'values': [0.2, 0.3, 0.4],
'metadata': {'pdf_file': 'BBB.pdf', 'text': 'CCC'},
'sparse_values': {
'indices': [15, 40, 11],
'values': [0.4, 0.5, 0.2]
}}
]
A function to convert to this format, plus other conveniences can be found in utils.py. Also make sure to create a settings.py file following the example, and enter your Pinecone API key.
Semantic Query
Our setup is now ripe for querying, as show in queryExample.py. For instance, when our query is "Foreign trade sanctions", the ensuing result is:
0.4 Match - id: 359, pdf_file: STG36KDTBPEJTVEWCY734EISHQ5ZUVWF.pdf, text summary: Summarize: Table 1267. U.S. Exports and Imports for Consumption of Merchandise. By Customs District: 2000 to 2008. In billions of dollars (780.0 represents $780,000,000,.000).
0.39 Match - id: 246, pdf_file: DQSGXNBZHHZWC6653JA34RT56TTUZ6W5.pdf, text summary: summarize: Summary US vs. Foreign by Program and Vessel Type for 11/20/2007 12:16:51 PM. 10/1/2002 9/30/2003 throughBULK/TUG/BARGEProg. Total Metric TonsUS Metric TonsForeign Metric TONS %US %FR /Country Total OFR US OFR Foreign OFR %US%FR / country Total of the US Navy’s ships and submarines.
0.38 Match - id: 276, pdf_file: GOCYUTJWJZBXHIUSFYG3LVX5SG6HPEVV.pdf, text summary: USTDA's program inChina focuses onadvancing U.S. trade and commercial interests in transportation, energy, agriculture, and healthcare sectors. USTDA conducted 9 reverse trade missions to introduce Chinese officials to U.s. best practices.
0.37 Match - id: 286, pdf_file: WYIDAT7X2DXYMKGRFAXDEAA2EJF23GRQ.pdf, text summary: The Credit Union National Association (CUNA) appreciates the opportunity to criticize the proposed rule. CUNA represents more than 90 percent of our nation's 10,500 state and federal credit unions. We have always been concerned that the ever-increasing complexity of the OFAC sanctions programs raises the risk that entities may mistakenlymistakenly violate the requirment.
0.36 Match - id: 874, pdf_file: YVM7OB53IEXWGJUUIKINFKKEDK23PC5K.pdf, text summary: Report setting forth in full the cir-centriccumstances relating to such transfer promptly upon discovery that: Such transfer was in violation of provisions of this part or any regu-lation, ruling, instruction, direction, or license issued pursuant to this part; or such transfer was not licensed or authorized by the Director of the Office of Foreign Assets Control; or if a license did purport to cover the transfer, such license had been ob-tained by misrepresentation of a third party or withholding of material facts.
Future Improvements
A closer look at the results shows that they are all somewhat related to the original query, yet there's always room for improvement. A prior manipulation of input data, coupled with the omission of unrelated information, and incorporation of summarization techniques could improve the precision of results.
Conclusion
Hopefully this exercise shed light on the simplicity and efficiency of setting up semantic search for your array of documents. This setup can definitely be applied in the real world by an organization that wants to make its internal documents easily available - just use platforms like Django, Flask or FastAPI to generate a user friendly frontend or API endpoint, and enjoy the glory of living in the future with semantic search!