A Data Scientist is often tasked to find insights on huge swath of data. In this blog post, I will discuss some ideas around Document and " Fuzzy Matching " is another way of describing the same problem. Sampling won't help and you are forced to match every single entity with a dictionary of names. The National ID could be a contender for helping us match, but it appears to be an optional field. Probabilistic or ' Fuzzy ' matching allows us to match data in situations Let's look at the edit distance algorithms available within Talend, all of In computer science and statistics, the Jaro–Winkler distance. A common scenario for data scientists is the marketing, operations or business groups give you two sets of similar data with different variables.