Saturday 22 February 2014

Mathematical notation at a glance

Notation... If you have come across it, whether it's mathematical, musical or otherwise, perhaps you thought you had just stumbled into The Matrix or an episode of The Fringe; with the alphanumeric characters and a plethora of symbols, it can look like an alien language or some sort of ASCII art abomination.

While that might be the case, notation is used to convey ideas through symbols and makes life easier, especially when conveying complex ideas to a diverse audience. How?

See what you make of this:

Levenshtein algorithm notation from wikipedia article
Levenshtein Distance/Optimal String Alignment Algorithm
Source: http://en.wikipedia.org/wiki/Levenshtein_distance
Welcome to the real world of notation. I came across this algorithm a while ago where it was used in a problem involving two sets of data with no direct relationship apart from peoples names. A common problem with data such as peoples names is misspellings which means direct matches aren't guaranteed; that's where this algorithm can be put to use. It can also be used in computational biology, but that is beyond the scope of this post (do an internet search for 'Levenshtein biology' if you are intrigued about its use in biology).

The algorithm gives you the distance between two sequences; in other words, the cost involved in changing something into something else. I'm not going to delve into great detail about the Levenshtein algorithm in this post, just how to understand the notation. Feel free to make comments about the algorithm though if you'd like to know more.

So where do you begin with that algorithm? How do we turn that into something more than a notation? First step is to understand the symbols, much like understanding any language. Some elements of that algorithm you probably understand such as the -+ and = symbols, min and max. You will no doubt understand the words if and otherwise, but perhaps not how they are applied. We do need one extra piece of information which is: the Levenshtein distance between two strings a, b  is given by leva,b(|a||b|) which I will explain first.

So, firstly, a string is a sequence of characters which could be numbers, letters or symbols. a, b are used to represent two values, in this case, two strings. Lets say a represents the name 'Jon' and b represents 'Johnny'. So the Levenshtein distance between 'Jon' and 'Johnny' is given by leva,b(|a||b|). What is leva,b(|a||b|)? That is a function which takes two inputs |a| and |b| and outputs something. The vertical lines either side mean the magnitude or length of a and b. What are the lengths of the names 'Jon' and 'Johnny'? They are 3 and 6. We can now rewrite that function like so: lev'Jon','Johnny'(3, 6).

Let's have another look at that algorithm. You can see something that looks similar to leva,b(|a|, |b|) but it isn't quite the same; |a| and |b| are now i and j, but all that means is that i and j represent |a| and |b| in the same way that a and b represent 'Jon' and 'Johnny'. At this point, we should have a better idea about what certain elements of the algorithm mean, or at the very least, what they represent.

There are still curly braces and the if and otherwise words which we may not understand though. Quite simply, the curly braces group things together. Reading from left to right, the far left curly brace groups together the if min(i, j) = 0, max(i, j) and min. The inner curly brace groups together the three function references. The if and otherwise are conditions, so if the minimum value of i and j is equal to 0 then we need the maximum value of i and j which means leva,b(ij) or the Levenshtein distance is the maximum value of i and j which represent the lengths of a and b.

So that is the if condition handled, what about the otherwise? Well, if min(i, j) does not equal 0 then we need the minimum of that inner group with the three function references which means leva,b(ij) or the Levenshtein distance is the minimum value of that inner group.

The inner group is probably the most complicated part of this whole algorithm. Each of those three references refers to the Levenshtein algorithm but with different inputs for i and j. Not only that, but for each of those references, except the last one, we add 1 to whatever leva,b(ij) is equal to which I explained above. So what about this last bit: + 1(aibj)? It means we add 1 if the character at position i of sequence a does not equal the character at position j of the sequence b. So if a and b are 'Jon' and 'Johnny', and position i and j are equal to 3 (or 2 in zero-based arrays), then we take the third letter of each name and compare them. In this instance, the third letters are 'n' and 'h' which are not equal, so we would add 1. You might then ask, what do we add if the two characters are equal? We don't add anything, so it's 0.

Head spinning? You might be happy to know that that is where the glance at notation ends and you can throw questions my way if you would like to know more.

One final thing... I may have shown you how to read elements of mathematical notation, but I never showed you a way of turning it into something more that a bunch of symbols, something that can be used. It may not surprise you that the one way I would show you how it can be used is through computer programming; it's what I do. If you don't understand computer programming I would still encourage you to have a look at this computer code I wrote using the Python programming language and see if you can spot the similarities between the code and the original algorithm.

def LevDist(s, s_len, t, t_len):
 if min(s_len, t_len) == 0:
  return max(s_len, t_len)
 else:
  cost = 0
  if s[s_len-1] != t[t_len-1]:
   cost = 1
  return min(LevDist(s, s_len-1, t, t_len)+1,
                       min(LevDist(s, s_len, t, t_len-1)+1, 
                       LevDist(s, s_len-1, t, t_len-1)+cost))

For the programmers among us, I know this is a very simple and poorly performing solution. The reason it has such poor performance is because of the recursion. The time it takes to determine the distance between strings increases exponentially in relation to the length of the two strings, resulting in poor performance. I won't be discussing better implementations in this post, but if you want to have a look at some potential solutions in one of many programming languages then check out this link to rosettacode.org website.

I also found this resource from stanford.edu regarding edit distances/Levenshtein algorithm which is worth a read: Edit Distance Notes

No comments:

Post a Comment