Learning Curve

Sunday 12 October 2014

Getting the most out of SQL Server: CLR Integration, Part Two

In part one of this article I mentioned that I would be files and documentation for providing regular expression support on SQL Server and that time has now arrived.

Please note that my implementation is very simple, but that was intended. Credits go to Microsoft and MSDN for supplying extensive documentation on the classes and methods involved, especially for the Concatenate Aggregate function source which has nothing to do with regex but I put it there to demonstrate creation of a CLR Aggregate function. One way you could improve the Concatenate Aggregate function would be to add a parameter for specifying a field delimiter (currently a comma).

You can find below two links to the files you will need. If you are interested in the code behind the regex functions you will be creating on your database, please download the project files developed in Visual Studio 2010. If you are only interested in creating the regex functions on your database, please download the binaries.

First of all, understand that I am not giving in-depth instruction on how to write CLR functions in managed code, set up your code editor, compile your code and install it on your database; that is what the project files are for. If you are inquisitive, you may have installed my regex functions using the files I provided already, otherwise, follow these steps:

Open install.sql in SQL Server Management Studio
Select database from drop-down to install functions on
Press CTRL + L SHIFT + M in Query Window and enter template parameters
Execute script

Provided you had no errors, the functions will have been created. If you do have problems then comment on this post and I will do my best to help. Also, if you have any questions regarding the development of CLR functions then please feel free to comment on this post or contact me directly.

Something you should be aware of is that if you can install Master Data Services (only available to Enterprise or Business Intelligence users) then you can access regex functionality without writing your own custom CLR functions or installing my regex implementation. You can find details about that in the SQL 2008 R2 regular expressions post on Robert Dyballs blog.

Tuesday 19 August 2014

Getting the most out of SQL Server: CLR Integration, Part One

So this is the first post in a series I will be doing over the next few months or so that will highlight some aspects of SQL Server I have used in the past few years, why you should consider them and how you can use them. Obviously there have been thousands of articles written about the features I will be talking about, however, I plan to take an in-depth look at these features including scenarios and examples, presenting it in a Q & A format.

The first feature is CLR (common language runtime) integration for CLR programming in SQL Server which was first made available in SQL Server 2005.

What is CLR integration all about?

In a nutshell, CLR integration allows you to create stored procedures, triggers, data types, user-defined scalar, aggregate and table valued functions with any .NET programming language including C#, Visual C++ and Visual Basic.NET. This means you have access to the .NET framework and the power of managed code in SQL Server and your SQL queries.

Why has this feature been made available when we can create those objects using SQL?

SQL Server is an RDBMS which are the kings of data storage and SQL is the king of data querying. In other words, they are designed to do specific tasks optimally although they can perform other tasks, just not as well.

One task that SQL Server can perform but not optimally is string manipulation using functions such as PATINDEX, REPLACE, SUBSTRING, STUFF, and TRIM. Granted, string manipulation functions have been made available, but you are limited as to what you can do compared to the string manipulation functionality available in the .NET framework and .NET programming languages.

Take PATINDEX for example; you can build very simple pattern matching expressions using four wildcard operators, but identifying any complex patterns would involve nesting PATINDEX calls with other string functions available to SQL Server and possibly using loop statements to look at individual characters in a string. Compare that to regular expressions and perhaps you can begin to see the limitations. Also consider that depending on how you use these functions in your SQL statements, your SQL query may not take full advantage of the SQL Server engine since SQL operations work on the basis of set theory, that is to say, it operates on sets of data rather than individual pieces of data.

Why is CLR integration disabled by default, does that mean there are potential security implications?

CLR integration being disabled by default does not stop you or someone with enough privileges from installing CLR objects onto your database, it simply stops them being executed. After reading that you might think that when it is enabled it just opens up your database to malicious use by allowing unrestricted access to CLR objects that have been installed. Hold that thought as there are safeguards in place, two of which are CAS (Code Access Security) and HPA (Host Protection Attribute).

What are CAS and HPA and how do they protect my database when CLR is enabled?

CAS is a security model for managed code (it's not just specific to SQL Server) that determines what resources such as files, networks and the registry on the computer the code can access and what level of access is permitted. There are three places these permissions are defined: the machine, user and host. The first two are quite obvious; the last is the permission set chosen when creating CLR objects on SQL Server. There are three of these sets named SAFE, EXTERNAL_ACCESS and UNSAFE.

SAFE restricts the code from accessing any external resource such as files, the network, environment variables and the registry. In other words, you cannot access anything that is not part of SQL Server. It is the safest or most restrictive permission set as the name suggests.

EXTERNAL_ACCESS has the same permissions as SAFE but with the ability to specify access to resources outside of SQL Server, some of which I named above, but the full list can be found in the CLR Integration Code Access Security article on MSDN. Because you can allow or deny permissions to specific external resources your database is still protected, but there is obviously a little more room for malicious use, especially when allowing access to the internet.

UNSAFE is the least restrictive set as the name suggests and allows unrestricted access to local and external resources as well as the execution of native or unmanaged code. You could write code that performed operations on the SQL Server process if the permissions are set correctly.

Venn diagram depicting Code Access Security policies

Code Access Security Policies (exc. Enterprise)

CAS might seem a little scary if someone with enough privileges could allow CLR objects to execute
unrestricted code with the single keyword UNSAFE, but in reality it is much more secure than that. Even if the CLR objects have UNSAFE permissions, if the machine or user have permissions that deny access to a resource that UNSAFE permissions allows, then those permissions will be used instead. When you run a CLR object, all three sets of permissions are checked, so the final set of permissions can be considered the intersection of the three, that is, machine, user and host as seen in the Venn diagram.

Then you have HPA which is much more explicit than CAS in that it disallows specific bits of code, either types or methods that can be used by the CLR object when it is created using the SAFE or EXTERNAL_ACCESS permissions set. Some examples of disallowed code are the opening of message boxes or accessing peripheral devices. HPA is more than just a security feature as it also assists in the reliability of the system by preventing usage of code that may affect threads and processes which could result in memory leaks for example. A full list of disallowed types and methods can be found in the Host Protection Attributes and CLR Integration Programming article on MSDN.

It is worth noting at this point that as of .NET Framework 4/CLR 4.0 major changes have been made to the CAS model which includes the removal of the policy-based system. However, this particular change does not affect SQL Server users; SQL Server will continue to use the CAS model defined in CLR 2.0 which uses the policy-based system as mentioned in What's New in CLR integration article on MSDN.

In part two I will be providing an example along with files and documentation for enabling regular expressions through CLR objects on your SQL Server.

Stay tuned.

Tuesday 3 June 2014

Databases - more than a big box of stuff

First up, my apologies for the lack of blog posts. I've been hard at work on some new projects over the past few months that have taken up my time and which I'll be blogging about soon enough.

So, without further ado, lets begin.

"Database: a comprehensive collection of related data organized for convenient access [...]"

Source: Dictionary.com

Simply put, databases and data warehouses are systems for storing information which we have no doubt worked with or have heard of at some point in our lives. In fact, you may have used a database without even realising it. Dictionary.com defines the word database as: 'a comprehensive collection of related data organised for convenient access [...]'. Can you think of some examples that fit that definition?

Here are a few examples I came up with, most of which fit the definition loosely but can ultimately be defined as a database nonetheless:

Rolodex
Telephone book
Filing Cabinet
Mobile Phone Contacts List
Computer Hard-drives
Computer Software/Systems (MySQL, SQL Server, Oracle, sqlite, Access, Excel, etcetera)

These examples allow for convenient storage and access to information whether its by collections of alphabetically indexed cards and numbered pages or physical disk platters and digital tables. The key points to bear in mind regarding databases are that the data is organized and organized in such a way to allow for easy access.

The reason for this brief explanation is quite simply to show you that databases are not simply boxes of stuff you can just throw things into and pick out of at random; they don't look like my attic, disorganized and awkward to access.

That might seem obvious to most people, but it is a lesson that I learnt the harder way, partly because of my background in applications programming where data is manipulated in a different way to databases. My first attempts at database development and writing SQL showed this.

If you work with or intend to work with databases (digital or otherwise) and use them as I did, now is the time to consider changing that. So how can you do that?

Understand the basic concepts and principles of X

I cannot stress this enough as I have learnt this the hard way in some instances, that, without understanding the concepts and principles of whatever X is, we are likely going to be working against ourselves, relying on guesswork, trial and error and our limited or non-existent experience. It's like learning another language where some might learn stock words and phrases making their vocabulary very limited, while others learn how sentences are structured, how and why words mutate and how to correctly pronounce words so they can communicate with a certain degree of fluency.

This is true of databases, where, we might know about tables, views, functions and indexes and how to write SQL but we don't fully understand how or why something works. Obviously this limits us significantly in the effective use of databases in the same way a person learns a language by only remembering stock words and phrases.

So let's look at some basic database concepts, starting with one that I feel is not covered adequately enough (if at all) in Computer Science database modules and that is 'set theory'. Just to mention, the reason I feel it is not covered adequately enough is that I have worked with some CS graduates who took database modules but had very limited understanding of set theory (which I appreciate could be the student as much as it could be the teacher/module content) or weren't aware of it.

Set theory is a branch of mathematics that works with collections of objects which is precisely what a database is: a collection of information. So when we work with databases, understand that we are working with sets or collections of information, not simply individual records. Using the Telephone book as an example, when we turn to a page, we are looking at a collection of hundreds of phone numbers which we then need to filter to find the results we want, whether it is a single phone number or a subset of them. Sure, we might filter them by going line-by-line, but a database does not. That said, it is possible to go row-by-row in a database but it is not efficient or convenient and goes against the fundamental concepts of databases.

So that's the first concept, the second is that of 'relationships'. Going back to the dictionary definition, it uses the phrase 'related data', so what does that mean? A page in a Telephone book contains not just a list of telephone numbers but the person associated with that number. In other words, the person is related to that particular number and vice versa. In a database this is usually represented by tables and keys, so we might have a table with peoples names in which links to another table containing all the phone numbers using keys. Voila, we have related data.

Perhaps I have over-simplified it, but it'll suffice. Again, this concept of relational data has roots in mathematics, namely, relational algebra and tends to be formally referred to as the 'entity-relationship model' or the 'relational model'.

Third concept is 'indexing'. It is worth mentioning at this point that SQL (structured query language) is made up of separate languages, namely DCL (data control language), DDL (data definition language) and DML (data manipulation language). I won't be covering them in detail, but indexing falls under DDL and DML.

DDL is used for defining or describing data, part of which involves the creation of indexes which is as much as I will be saying about DDL for now. DML is used for manipulating the data and uses indexes to do so. Remember that a database allows for easy access which indexes are essential for. Going back to the Telephone book, imagine if we removed the alphabetical ordering, page numbers and categorization and then threw all the pages into a bag. Now you need to find the phone numbers for all J. Does. Not so easy now is it?

Put simply, indexing is a way of maintaining organized lists that reference locations of information. Like the Telephone book, most textbooks have an index at the front or back with titles or keywords linked to page numbers so we can find information easily. It's outside the scope of this post, but it is worth mentioning that databases actually use the concept of pages and indexes in a similar way to a textbook. If you'd like to find out more and can handle some technical jargon, then have a look at the SQL Server Planning and Architecture Articles over at Microsoft TechNet.

As a final note, not all databases are relational. If you have a few minutes to spare and are curious, check out non-relational databases which are sometimes referred to as NoSQL databases.

It goes without saying that I have barely scratched the surface of relational databases. I'll leave it there for now and save the rest for another day so stay tuned.

Saturday 22 February 2014

Mathematical notation at a glance

Notation... If you have come across it, whether it's mathematical, musical or otherwise, perhaps you thought you had just stumbled into The Matrix or an episode of The Fringe; with the alphanumeric characters and a plethora of symbols, it can look like an alien language or some sort of ASCII art abomination.

While that might be the case, notation is used to convey ideas through symbols and makes life easier, especially when conveying complex ideas to a diverse audience. How?

See what you make of this:

$Levenshtein algorithm notation from wikipedia article$

Levenshtein Distance/Optimal String Alignment Algorithm
Source: http://en.wikipedia.org/wiki/Levenshtein_distance

Welcome to the real world of notation. I came across this algorithm a while ago where it was used in a problem involving two sets of data with no direct relationship apart from peoples names. A common problem with data such as peoples names is misspellings which means direct matches aren't guaranteed; that's where this algorithm can be put to use. It can also be used in computational biology, but that is beyond the scope of this post (do an internet search for 'Levenshtein biology' if you are intrigued about its use in biology).

The algorithm gives you the distance between two sequences; in other words, the cost involved in changing something into something else. I'm not going to delve into great detail about the Levenshtein algorithm in this post, just how to understand the notation. Feel free to make comments about the algorithm though if you'd like to know more.

So where do you begin with that algorithm? How do we turn that into something more than a notation? First step is to understand the symbols, much like understanding any language. Some elements of that algorithm you probably understand such as the -, + and = symbols, min and max. You will no doubt understand the words if and otherwise, but perhaps not how they are applied. We do need one extra piece of information which is: the Levenshtein distance between two strings a, b is given by lev_a,b(|a|, |b|) which I will explain first.

So, firstly, a string is a sequence of characters which could be numbers, letters or symbols. a, b are used to represent two values, in this case, two strings. Lets say a represents the name 'Jon' and b represents 'Johnny'. So the Levenshtein distance between 'Jon' and 'Johnny' is given by lev_a,b(|a|, |b|). What is lev_a,b(|a|, |b|)? That is a function which takes two inputs |a| and |b| and outputs something. The vertical lines either side mean the magnitude or length of a and b. What are the lengths of the names 'Jon' and 'Johnny'? They are 3 and 6. We can now rewrite that function like so: lev_{'Jon','Johnny'}(3, 6).

Let's have another look at that algorithm. You can see something that looks similar to lev_a,b(|a|, |b|) but it isn't quite the same; |a| and |b| are now i and j, but all that means is that i and j represent |a| and |b| in the same way that a and b represent 'Jon' and 'Johnny'. At this point, we should have a better idea about what certain elements of the algorithm mean, or at the very least, what they represent.

There are still curly braces and the if and otherwise words which we may not understand though. Quite simply, the curly braces group things together. Reading from left to right, the far left curly brace groups together the if min(i, j) = 0, max(i, j) and min. The inner curly brace groups together the three function references. The if and otherwise are conditions, so if the minimum value of i and j is equal to 0 then we need the maximum value of i and j which means lev_a,b(i, j) or the Levenshtein distance is the maximum value of i and j which represent the lengths of a and b.

So that is the if condition handled, what about the otherwise? Well, if min(i, j) does not equal 0 then we need the minimum of that inner group with the three function references which means lev_a,b(i, j) or the Levenshtein distance is the minimum value of that inner group.

The inner group is probably the most complicated part of this whole algorithm. Each of those three references refers to the Levenshtein algorithm but with different inputs for i and j. Not only that, but for each of those references, except the last one, we add 1 to whatever lev_a,b(i, j) is equal to which I explained above. So what about this last bit: + 1_{(a_i}≠_{b_j})? It means we add 1 if the character at position i of sequence a does not equal the character at position j of the sequence b. So if a and b are 'Jon' and 'Johnny', and position i and j are equal to 3 (or 2 in zero-based arrays), then we take the third letter of each name and compare them. In this instance, the third letters are 'n' and 'h' which are not equal, so we would add 1. You might then ask, what do we add if the two characters are equal? We don't add anything, so it's 0.

Head spinning? You might be happy to know that that is where the glance at notation ends and you can throw questions my way if you would like to know more.

One final thing... I may have shown you how to read elements of mathematical notation, but I never showed you a way of turning it into something more that a bunch of symbols, something that can be used. It may not surprise you that the one way I would show you how it can be used is through computer programming; it's what I do. If you don't understand computer programming I would still encourage you to have a look at this computer code I wrote using the Python programming language and see if you can spot the similarities between the code and the original algorithm.

def LevDist(s, s_len, t, t_len):
 if min(s_len, t_len) == 0:
  return max(s_len, t_len)
 else:
  cost = 0
  if s[s_len-1] != t[t_len-1]:
   cost = 1
  return min(LevDist(s, s_len-1, t, t_len)+1,
                       min(LevDist(s, s_len, t, t_len-1)+1, 
                       LevDist(s, s_len-1, t, t_len-1)+cost))

For the programmers among us, I know this is a very simple and poorly performing solution. The reason it has such poor performance is because of the recursion. The time it takes to determine the distance between strings increases exponentially in relation to the length of the two strings, resulting in poor performance. I won't be discussing better implementations in this post, but if you want to have a look at some potential solutions in one of many programming languages then check out this link to rosettacode.org website.

I also found this resource from stanford.edu regarding edit distances/Levenshtein algorithm which is worth a read: Edit Distance Notes