Enhancing Fuzzy Name Search with Generative AI: A Hybrid Approach Using TheFuzz and Sentence-Transformers — Amazing Python Package Showcase (9)

Enhance your fuzzy name search accuracy and flexibility by combining traditional methods with Generative AI. By following my step-by-step implementation, you can create a robust name search solution that balances efficiency with advanced AI-driven accuracy, making it easier to handle name variations and improve search outcomes.
Enhancing Fuzzy Name Search with Generative AI by Python — AI Art generated by DALL-E

In many industries, accurately matching names within large datasets is crucial. Whether it’s in healthcare, customer service, or financial services, finding the right records quickly and accurately can make a significant difference. However, name variations, typos, and cultural differences can pose challenges to traditional name matching methods.

This is where fuzzy name search comes into play. Fuzzy name search allows you to find approximate matches rather than exact ones, making your search process more flexible and robust. In this article, we’ll explore how to enhance fuzzy name search by combining traditional methods with Generative AI. Specifically, we’ll use a hybrid approach that leverages TheFuzz, a library for character-based fuzzy matching, and Sentence-Transformers, which uses advanced AI models to understand the semantic meaning of names.

Why Fuzzy Name Search Matters

Fuzzy name search is a critical feature in various fields, particularly when dealing with large datasets or when accuracy in identifying individuals or entities is paramount. Here’s why it’s important:

Critical in Certain Industries

  • Legal and Law Enforcement: Matching names in legal documents, criminal records, or during background checks often requires fuzzy search to account for variations in how names are recorded.
  • Healthcare: Accurate patient identification is crucial to avoid medical errors. Fuzzy name search ensures that the correct patient records are retrieved, even if the name is not entered perfectly.
  • Finance and Banking: For compliance and fraud detection, fuzzy name search is used to match names against watchlists or blacklists, where names might be misspelled or altered deliberately.

Data Quality Issues

  • Human Error: Data entry errors such as typos, misspellings, or inconsistent formatting are common. Fuzzy name search helps in identifying the correct name even when there are slight discrepancies.
  • Inconsistent Naming Conventions: Names can be recorded differently due to variations in spelling, use of abbreviations, or different cultural conventions (e.g., “John Doe” vs. “Doe, John”). Fuzzy search helps to match these variations.

Improved Search Accuracy

  • Enhanced Matching Capabilities: Fuzzy search algorithms use techniques like Levenshtein distance, Soundex, or Jaro-Winkler to find names that are similar but not exact matches. This improves the chances of finding the right individual or record, even if the name is not spelled exactly as expected.
  • Handling of Ambiguities: Names can be ambiguous (e.g., common names like “Smith” or “Brown”). Fuzzy search allows for better disambiguation by considering similar or related names.

Enhanced User Experience

  • Search Functionality: For applications with user search capabilities, offering fuzzy name search improves user satisfaction by returning relevant results even when the input is not perfect.
  • Reduced Frustration: Users often misspell names or enter them in a different format than stored. Fuzzy search helps in minimizing frustration by providing accurate results.

Inclusivity

  • Cultural Sensitivity: Names vary widely across cultures, and fuzzy search helps in accommodating these differences, ensuring that individuals are correctly identified regardless of cultural naming conventions.

Enhancing Fuzzy Name Search with Generative AI

Generative AI can significantly enhance fuzzy name search by leveraging its deep understanding of language, context, and semantics. Here’s how Generative AI, particularly through advanced models like Large Language Models (LLMs), can outperform traditional methods in fuzzy name search:

Understanding Context and Semantics

  • Contextual Awareness: Generative AI models, such as GPT-4 or those used in Sentence-Transformers, are trained on vast amounts of data that include various contexts and languages. This enables them to understand not just the names themselves but also the context in which they are used. For example, a generative model can distinguish between “John Smith” in a medical record and “John Smith” in a legal document based on surrounding text.
  • Semantic Matching: Traditional methods like TheFuzz rely on character-based comparisons, which may miss deeper semantic connections between names. Generative AI, on the other hand, can recognize that “Johnathan” and “Johnny” might refer to the same person, even though the character sequences are quite different.

Handling Complex Variations

  • Phonetic Matching: Generative AI can better handle phonetic variations, recognizing that “Catherine” and “Katherine” are phonetically similar and likely the same name. It can also account for regional accents and dialects, which may affect name pronunciation and spelling.
  • Nicknames and Aliases: Generative AI can understand common nicknames, abbreviations, and aliases that traditional fuzzy matching might miss. For example, it can relate “Bob” to “Robert” or “Liz” to “Elizabeth,” providing a more comprehensive search capability.

Multilingual and Cross-Cultural Name Matching

  • Multilingual Capabilities: Generative AI models trained on multilingual datasets can match names across different languages and scripts. For instance, they can connect “محمد” (Muhammad) in Arabic script with “Mohammed” in Latin script.
  • Cultural Sensitivity: These models are also aware of cultural naming conventions and can adjust their matching algorithms accordingly, making them more effective in global applications.

Adaptive and Continuous Learning

  • Continuous Improvement: Generative AI models can be fine-tuned on specific datasets or updated with new data, allowing them to continuously improve their matching accuracy. This adaptive capability makes them more robust over time compared to static algorithms like Levenshtein distance used in traditional fuzzy matching.
  • Customizable Sensitivity: The sensitivity of matching can be dynamically adjusted based on the context or application. For example, a higher sensitivity might be needed in legal documents, while a broader matching scope could be useful in social media data.

Integrating with Traditional Methods

  • Hybrid Approaches: Generative AI can be integrated with traditional fuzzy matching methods to create hybrid approaches. For example, an AI model could first filter potential matches based on semantic similarity, and then a traditional method like TheFuzz could rank these matches based on character-level similarity.
  • Error Correction: AI can suggest corrections for misspelled names before applying traditional fuzzy matching techniques, thus improving overall accuracy.

Example Implementation

The example is run in Windows Subsystem for Linux (WSL) 2

Create a Virtual Environment

Create an environment in your project using virtualenv

dynotes@P2021:~/projects/python/fuzzy$ virtualenv fuzzy

created virtual environment CPython3.10.12.final.0-64 in 1468ms
creator CPython3Posix(dest=~/projects/python/fuzzy/fuzzy, clear=False, no_vcs_ignore=False, global=False)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=~/.local/share/virtualenv)
added seed packages: pip==24.1.2, setuptools==71.1.0, wheel==0.43.0
activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator

dynotes@P2021:~/projects/python/fuzzy$ source ./fuzzy/bin/activate
(fuzzy) dynotes@P2021:~/projects/python/fuzzy$

Installing Required Packages

Before we dive into the code, make sure you have Python installed on your system. We’ll also need to install the required Python packages. Open your terminal or command prompt and run the following command:

(fuzzy) dynotes@P2021:~/projects/python/fuzzy$ pip install thefuzz[speedup] sentence-transformers

This command installs:

  • TheFuzz: A library for character-based fuzzy matching.
  • Sentence-Transformers: A library that leverages Generative AI for semantic text matching.
Collecting sentence-transformers
Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting thefuzz[speedup]
Downloading thefuzz-0.22.1-py3-none-any.whl.metadata (3.9 kB)
Collecting rapidfuzz<4.0.0,>=3.0.0 (from thefuzz[speedup])
Downloading rapidfuzz-3.9.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting transformers<5.0.0,>=4.34.0 (from sentence-transformers)
Downloading transformers-4.44.2-py3-none-any.whl.metadata (43 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43.7/43.7 kB 842.9 kB/s eta 0:00:00
Collecting tqdm (from sentence-transformers)
Downloading tqdm-4.66.5-py3-none-any.whl.metadata (57 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.6/57.6 kB 1.8 MB/s eta 0:00:00
Collecting torch>=1.11.0 (from sentence-transformers)
Using cached torch-2.4.0-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)
Collecting numpy (from sentence-transformers)
Downloading numpy-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.9/60.9 kB 2.8 MB/s eta 0:00:00
Collecting scikit-learn (from sentence-transformers)
Using cached scikit_learn-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting scipy (from sentence-transformers)
Downloading scipy-1.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.8/60.8 kB 3.4 MB/s eta 0:00:00
Collecting huggingface-hub>=0.15.1 (from sentence-transformers)
Downloading huggingface_hub-0.24.6-py3-none-any.whl.metadata (13 kB)
Collecting Pillow (from sentence-transformers)
Using cached pillow-10.4.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (9.2 kB)
Collecting filelock (from huggingface-hub>=0.15.1->sentence-transformers)
Using cached filelock-3.15.4-py3-none-any.whl.metadata (2.9 kB)
Collecting fsspec>=2023.5.0 (from huggingface-hub>=0.15.1->sentence-transformers)
Using cached fsspec-2024.6.1-py3-none-any.whl.metadata (11 kB)
Collecting packaging>=20.9 (from huggingface-hub>=0.15.1->sentence-transformers)
Using cached packaging-24.1-py3-none-any.whl.metadata (3.2 kB)
Collecting pyyaml>=5.1 (from huggingface-hub>=0.15.1->sentence-transformers)
Downloading PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
Collecting requests (from huggingface-hub>=0.15.1->sentence-transformers)
Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting typing-extensions>=3.7.4.3 (from huggingface-hub>=0.15.1->sentence-transformers)
Using cached typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Collecting sympy (from torch>=1.11.0->sentence-transformers)
Downloading sympy-1.13.2-py3-none-any.whl.metadata (12 kB)
Collecting networkx (from torch>=1.11.0->sentence-transformers)
Using cached networkx-3.3-py3-none-any.whl.metadata (5.1 kB)
Collecting jinja2 (from torch>=1.11.0->sentence-transformers)
Using cached jinja2-3.1.4-py3-none-any.whl.metadata (2.6 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
Using cached nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence-transformers)
Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch>=1.11.0->sentence-transformers)
Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch>=1.11.0->sentence-transformers)
Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cusolver-cu12==11.4.5.107 (from torch>=1.11.0->sentence-transformers)
Using cached nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cusparse-cu12==12.1.0.106 (from torch>=1.11.0->sentence-transformers)
Using cached nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-nccl-cu12==2.20.5 (from torch>=1.11.0->sentence-transformers)
Using cached nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl.metadata (1.8 kB)
Collecting nvidia-nvtx-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
Using cached nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.7 kB)
Collecting triton==3.0.0 (from torch>=1.11.0->sentence-transformers)
Using cached triton-3.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.3 kB)
Collecting nvidia-nvjitlink-cu12 (from nvidia-cusolver-cu12==11.4.5.107->torch>=1.11.0->sentence-transformers)
Downloading nvidia_nvjitlink_cu12-12.6.20-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting regex!=2019.12.17 (from transformers<5.0.0,>=4.34.0->sentence-transformers)
Using cached regex-2024.7.24-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
Collecting safetensors>=0.4.1 (from transformers<5.0.0,>=4.34.0->sentence-transformers)
Downloading safetensors-0.4.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers<5.0.0,>=4.34.0->sentence-transformers)
Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting joblib>=1.2.0 (from scikit-learn->sentence-transformers)
Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn->sentence-transformers)
Using cached threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Collecting MarkupSafe>=2.0 (from jinja2->torch>=1.11.0->sentence-transformers)
Using cached MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting charset-normalizer<4,>=2 (from requests->huggingface-hub>=0.15.1->sentence-transformers)
Using cached charset_normalizer-3.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (33 kB)
Collecting idna<4,>=2.5 (from requests->huggingface-hub>=0.15.1->sentence-transformers)
Downloading idna-3.8-py3-none-any.whl.metadata (9.9 kB)
Collecting urllib3<3,>=1.21.1 (from requests->huggingface-hub>=0.15.1->sentence-transformers)
Using cached urllib3-2.2.2-py3-none-any.whl.metadata (6.4 kB)
Collecting certifi>=2017.4.17 (from requests->huggingface-hub>=0.15.1->sentence-transformers)
Using cached certifi-2024.7.4-py3-none-any.whl.metadata (2.2 kB)
Collecting mpmath<1.4,>=1.1.0 (from sympy->torch>=1.11.0->sentence-transformers)
Using cached mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 227.1/227.1 kB 3.6 MB/s eta 0:00:00
Downloading huggingface_hub-0.24.6-py3-none-any.whl (417 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 417.5/417.5 kB 7.1 MB/s eta 0:00:00
Downloading rapidfuzz-3.9.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.4/3.4 MB 10.2 MB/s eta 0:00:00
Using cached torch-2.4.0-cp310-cp310-manylinux1_x86_64.whl (797.2 MB)
Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Using cached nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)
Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
Using cached nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)
Using cached nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)
Using cached nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)
Using cached nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
Using cached triton-3.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)
Downloading tqdm-4.66.5-py3-none-any.whl (78 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.4/78.4 kB 98.2 kB/s eta 0:00:00
Downloading transformers-4.44.2-py3-none-any.whl (9.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.5/9.5 MB 9.9 MB/s eta 0:00:00
Downloading numpy-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.3/16.3 MB 9.0 MB/s eta 0:00:00
Using cached pillow-10.4.0-cp310-cp310-manylinux_2_28_x86_64.whl (4.5 MB)
Using cached scikit_learn-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.4 MB)
Downloading scipy-1.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (41.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41.2/41.2 MB 6.5 MB/s eta 0:00:00
Downloading thefuzz-0.22.1-py3-none-any.whl (8.2 kB)
Using cached fsspec-2024.6.1-py3-none-any.whl (177 kB)
Using cached joblib-1.4.2-py3-none-any.whl (301 kB)
Using cached packaging-24.1-py3-none-any.whl (53 kB)
Downloading PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (751 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 751.2/751.2 kB 9.8 MB/s eta 0:00:00
Using cached regex-2024.7.24-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (776 kB)
Downloading safetensors-0.4.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (435 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 435.5/435.5 kB 7.7 MB/s eta 0:00:00
Using cached threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 11.2 MB/s eta 0:00:00
Using cached typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Using cached filelock-3.15.4-py3-none-any.whl (16 kB)
Using cached jinja2-3.1.4-py3-none-any.whl (133 kB)
Using cached networkx-3.3-py3-none-any.whl (1.7 MB)
Using cached requests-2.32.3-py3-none-any.whl (64 kB)
Downloading sympy-1.13.2-py3-none-any.whl (6.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 10.6 MB/s eta 0:00:00
Using cached certifi-2024.7.4-py3-none-any.whl (162 kB)
Using cached charset_normalizer-3.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (142 kB)
Downloading idna-3.8-py3-none-any.whl (66 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.9/66.9 kB 2.8 MB/s eta 0:00:00
Using cached MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
Using cached mpmath-1.3.0-py3-none-any.whl (536 kB)
Using cached urllib3-2.2.2-py3-none-any.whl (121 kB)
Downloading nvidia_nvjitlink_cu12-12.6.20-py3-none-manylinux2014_x86_64.whl (19.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.7/19.7 MB 8.8 MB/s eta 0:00:00
Installing collected packages: mpmath, urllib3, typing-extensions, tqdm, threadpoolctl, sympy, safetensors, regex, rapidfuzz, pyyaml, Pillow, packaging, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, networkx, MarkupSafe, joblib, idna, fsspec, filelock, charset-normalizer, certifi, triton, thefuzz, scipy, requests, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, scikit-learn, nvidia-cusolver-cu12, huggingface-hub, torch, tokenizers, transformers, sentence-transformers
Successfully installed MarkupSafe-2.1.5 Pillow-10.4.0 certifi-2024.7.4 charset-normalizer-3.3.2 filelock-3.15.4 fsspec-2024.6.1 huggingface-hub-0.24.6 idna-3.8 jinja2-3.1.4 joblib-1.4.2 mpmath-1.3.0 networkx-3.3 numpy-2.1.0 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.6.20 nvidia-nvtx-cu12-12.1.105 packaging-24.1 pyyaml-6.0.2 rapidfuzz-3.9.6 regex-2024.7.24 requests-2.32.3 safetensors-0.4.4 scikit-learn-1.5.1 scipy-1.14.1 sentence-transformers-3.0.1 sympy-1.13.2 thefuzz-0.22.1 threadpoolctl-3.5.0 tokenizers-0.19.1 torch-2.4.0 tqdm-4.66.5 transformers-4.44.2 triton-3.0.0 typing-extensions-4.12.2 urllib3-2.2.2

Setting Up the Names Database and Query

First, we’ll create a new Python file fuzzy.py, then define a sample database of names and a query name that we want to search for.

# Sample Names Database
names_database = [
"Katherine Johnson",
"Catherine Johnson",
"Kathryn Jonson",
"Kathy Johnson",
"Johnathan Doe",
"Katlyn Johns",
"Jonathon Dow",
"Katie Johnsen"
]

# User Input
query_name = "Kathryn Jonson"

In this example, “Kathryn Jonson” is the name we want to find in the database. Note that there are several similar names in the database with slight variations in spelling.

Implementing Semantic Matching with Sentence-Transformers

Sentence-Transformers use transformer-based models to generate dense vector embeddings that capture the semantic meaning of names. This allows us to compare names based on their context and meaning, rather than just their characters.

from sentence_transformers import SentenceTransformer, util
import torch

# Initialize the Sentence-Transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for the names in the database and the query name
name_embeddings = model.encode(names_database, convert_to_tensor=True)
query_embedding = model.encode(query_name, convert_to_tensor=True)

# Calculate the cosine similarity between the query and the database names
cosine_scores = util.pytorch_cos_sim(query_embedding, name_embeddings)

# Sort the results based on similarity scores
sorted_scores, indices = torch.sort(cosine_scores[0], descending=True)

# Display the top results based on semantic similarity
print("Semantic Similarity Results (using Sentence-Transformers):")
semantic_results = []
for idx in indices:
name = names_database[idx]
score = cosine_scores[0][idx].item()
semantic_results.append((name, score))
print(f"{name}: {score:.4f}")

Run it, we will get the following results. Please note that before calculating the Semantic Similarity Score, the Sentence-Transformers downloads the LLM, which is one-time operation as long as you use the same model.

(fuzzy) dynotes@WIN-P2021:~/projects/python/fuzzy$ python fuzzy.py

modules.json: 100%|███████████████████████████████| 349/349 [00:00<00:00, 1.19MB/s]
config_sentence_transformers.json: 100%|██████████| 116/116 [00:00<00:00, 661kB/s]
README.md: 100%|██████████████████████████████████| 10.7k/10.7k [00:00<00:00, 23.5MB/s]
sentence_bert_config.json: 100%|██████████████████| 53.0/53.0 [00:00<00:00, 169kB/s]
config.json: 100%|████████████████████████████████| 612/612 [00:00<00:00, 2.11MB/s]
model.safetensors: 100%|██████████████████████████| 90.9M/90.9M [00:08<00:00, 10.3MB/s]
tokenizer_config.json: 100%|██████████████████████| 350/350 [00:00<00:00, 963kB/s]
vocab.txt: 100%|██████████████████████████████████| 232k/232k [00:00<00:00, 3.68MB/s]
tokenizer.json: 100%|█████████████████████████████| 466k/466k [00:00<00:00, 5.86MB/s]
special_tokens_map.json: 100%|████████████████████| 112/112 [00:00<00:00, 622kB/s]
~/projects/python/fuzzy/fuzzy/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
1_Pooling/config.json: 100%|██████████████████████| 190/190 [00:00<00:00, 644kB/s]
Semantic Similarity Results (using Sentence-Transformers):
Kathryn Jonson: 1.0000
Katherine Johnson: 0.6201
Catherine Johnson: 0.6012
Kathy Johnson: 0.5486
Katlyn Johns: 0.5289
Katie Johnsen: 0.5251
Johnathan Doe: 0.3640
Jonathon Dow: 0.3471

(fuzzy) dynotes@P2021:~/projects/python/fuzzy$

Refining Results with Traditional Fuzzy Matching

While the semantic similarity scores give us a good sense of context, we can further refine these results by combining them with traditional fuzzy matching scores from TheFuzz.

Add the following code into fuzzy.py

from thefuzz import fuzz

# Refine top results using traditional fuzzy matching and combine scores
refined_results = []
for name, semantic_score in semantic_results:
fuzz_score = fuzz.ratio(query_name, name)
combined_score = (semantic_score + fuzz_score / 100) / 2 # Averaging semantic and fuzzy scores
refined_results.append((name, combined_score, fuzz_score))

# Sort the final refined results by the combined score and display
refined_results.sort(key=lambda x: x[1], reverse=True)

print("\nFinal Ranked Results (combining Semantic Similarity with TheFuzz):")
for name, combined_score, fuzz_score in refined_results:
print(f"{name}: Combined Score = {combined_score:.4f}, Fuzz Score = {fuzz_score}")

Expected Output

When you run python fuzzy.py, you should see output similar to this:

Semantic Similarity Results (using Sentence-Transformers):
Kathryn Jonson: 1.0000
Katherine Johnson: 0.6201
Catherine Johnson: 0.6012
Kathy Johnson: 0.5486
Katlyn Johns: 0.5289
Katie Johnsen: 0.5251
Johnathan Doe: 0.3640
Jonathon Dow: 0.3471

Final Ranked Results (combining Semantic Similarity with TheFuzz):
Kathryn Jonson: Combined Score = 1.0000, Fuzz Score = 100
Katherine Johnson: Combined Score = 0.7301, Fuzz Score = 84
Kathy Johnson: Combined Score = 0.7193, Fuzz Score = 89
Catherine Johnson: Combined Score = 0.6856, Fuzz Score = 77
Katlyn Johns: Combined Score = 0.6495, Fuzz Score = 77
Katie Johnsen: Combined Score = 0.5975, Fuzz Score = 67
Jonathon Dow: Combined Score = 0.4035, Fuzz Score = 46
Johnathan Doe: Combined Score = 0.4020, Fuzz Score = 44

Full Source Code

# Install required packages if not already installed
# pip install thefuzz[speedup] sentence-transformers

from sentence_transformers import SentenceTransformer, util
from thefuzz import fuzz
import torch

# Sample Names Database and Query Name
names_database = [
"Katherine Johnson",
"Catherine Johnson",
"Kathryn Jonson",
"Kathy Johnson",
"Johnathan Doe",
"Katlyn Johns",
"Jonathon Dow",
"Katie Johnsen"
]

query_name = "Kathryn Jonson"

# Step 1: Semantic Matching with Sentence-Transformers
model = SentenceTransformer('all-MiniLM-L6-v2')
name_embeddings = model.encode(names_database, convert_to_tensor=True)
query_embedding = model.encode(query_name, convert_to_tensor=True)
cosine_scores = util.pytorch_cos_sim(query_embedding, name_embeddings)
sorted_scores, indices = torch.sort(cosine_scores[0], descending=True)

print("Semantic Similarity Results (using Sentence-Transformers):")
semantic_results = []
#top_n = 5
#for idx in indices[:top_n]:
for idx in indices:
name = names_database[idx]
score = cosine_scores[0][idx].item()
semantic_results.append((name, score))
print(f"{name}: {score:.4f}")

# Step 2: Refining with TheFuzz
refined_results = []
for name, semantic_score in semantic_results:
fuzz_score = fuzz.ratio(query_name, name)
combined_score = (semantic_score + fuzz_score / 100) / 2
refined_results.append((name, combined_score, fuzz_score))

refined_results.sort(key=lambda x: x[1], reverse=True)

print("\nFinal Ranked Results (combining Semantic Similarity with TheFuzz):")
for name, combined_score, fuzz_score in refined_results:
print(f"{name}: Combined Score = {combined_score:.4f}, Fuzz Score = {fuzz_score}")

Summary

By combining the strengths of both semantic matching (using Sentence-Transformers) and traditional fuzzy matching (using TheFuzz), we can significantly enhance the accuracy and flexibility of fuzzy name search. This hybrid approach leverages the contextual understanding of Generative AI while also accounting for character-level similarities, providing a comprehensive solution to name matching challenges.

Whether you’re working with law enforcement databases, healthcare records, or any other domain where name matching is critical, this method offers a powerful tool for improving search accuracy. With this guide, you now have a step-by-step blueprint for implementing this enhanced fuzzy name search solution in your own projects.

You Might Also Like

Leave a Reply