Tuesday 3 June 2014

Databases - more than a big box of stuff

First up, my apologies for the lack of blog posts. I've been hard at work on some new projects over the past few months that have taken up my time and which I'll be blogging about soon enough.

So, without further ado, lets begin.

"Database: a comprehensive collection of related data organized for convenient access [...]"
Source: Dictionary.com
Simply put, databases and data warehouses are systems for storing information which we have no doubt worked with or have heard of at some point in our lives. In fact, you may have used a database without even realising it. Dictionary.com defines the word database as: 'a comprehensive collection of related data organised for convenient access [...]'. Can you think of some examples that fit that definition?

Here are a few examples I came up with, most of which fit the definition loosely but can ultimately be defined as a database nonetheless:
  • Rolodex
  • Telephone book
  • Filing Cabinet
  • Mobile Phone Contacts List
  • Computer Hard-drives
  • Computer Software/Systems (MySQL, SQL Server, Oracle, sqlite, Access, Excel, etcetera)
These examples allow for convenient storage and access to information whether its by collections of alphabetically indexed cards and numbered pages or physical disk platters and digital tables. The key points to bear in mind regarding databases are that the data is organized and organized in such a way to allow for easy access.

The reason for this brief explanation is quite simply to show you that databases are not simply boxes of stuff you can just throw things into and pick out of at random; they don't look like my attic, disorganized and awkward to access.

That might seem obvious to most people, but it is a lesson that I learnt the harder way, partly because of my background in applications programming where data is manipulated in a different way to databases. My first attempts at database development and writing SQL showed this.

If you work with or intend to work with databases (digital or otherwise) and use them as I did, now is the time to consider changing that. So how can you do that?

Understand the basic concepts and principles of X


I cannot stress this enough as I have learnt this the hard way in some instances, that, without understanding the concepts and principles of whatever X is, we are likely going to be working against ourselves, relying on guesswork, trial and error and our limited or non-existent experience. It's like learning another language where some might learn stock words and phrases making their vocabulary very limited, while others learn how sentences are structured, how and why words mutate and how to correctly pronounce words so they can communicate with a certain degree of fluency.

This is true of databases, where, we might know about tables, views, functions and indexes and how to write SQL but we don't fully understand how or why something works. Obviously this limits us significantly in the effective use of databases in the same way a person learns a language by only remembering stock words and phrases.

So let's look at some basic database concepts, starting with one that I feel is not covered adequately enough (if at all) in Computer Science database modules and that is 'set theory'. Just to mention, the reason I feel it is not covered adequately enough is that I have worked with some CS graduates who took database modules but had very limited understanding of set theory (which I appreciate could be the student as much as it could be the teacher/module content) or weren't aware of it.

Set theory is a branch of mathematics that works with collections of objects which is precisely what a database is: a collection of information. So when we work with databases, understand that we are working with sets or collections of information, not simply individual records. Using the Telephone book as an example, when we turn to a page, we are looking at a collection of hundreds of phone numbers which we then need to filter to find the results we want, whether it is a single phone number or a subset of them. Sure, we might filter them by going line-by-line, but a database does not. That said, it is possible to go row-by-row in a database but it is not efficient or convenient and goes against the fundamental concepts of databases.

So that's the first concept, the second is that of 'relationships'. Going back to the dictionary definition, it uses the phrase 'related data', so what does that mean? A page in a Telephone book contains not just a list of telephone numbers but the person associated with that number. In other words, the person is related to that particular number and vice versa. In a database this is usually represented by tables and keys, so we might have a table with peoples names in which links to another table containing all the phone numbers using keys. Voila, we have related data.

Perhaps I have over-simplified it, but it'll suffice. Again, this concept of relational data has roots in mathematics, namely, relational algebra and tends to be formally referred to as the 'entity-relationship model' or the 'relational model'.

Third concept is 'indexing'. It is worth mentioning at this point that SQL (structured query language) is made up of separate languages, namely DCL (data control language), DDL (data definition language) and DML (data manipulation language). I won't be covering them in detail, but indexing falls under DDL and DML.

DDL is used for defining or describing data, part of which involves the creation of indexes which is as much as I will be saying about DDL for now. DML is used for manipulating the data and uses indexes to do so. Remember that a database allows for easy access which indexes are essential for. Going back to the Telephone book, imagine if we removed the alphabetical ordering, page numbers and categorization and then threw all the pages into a bag. Now you need to find the phone numbers for all J. Does. Not so easy now is it?

Put simply, indexing is a way of maintaining organized lists that reference locations of information. Like the Telephone book, most textbooks have an index at the front or back with titles or keywords linked to page numbers so we can find information easily. It's outside the scope of this post, but it is worth mentioning that databases actually use the concept of pages and indexes in a similar way to a textbook. If you'd like to find out more and can handle some technical jargon, then have a look at the SQL Server Planning and Architecture Articles over at Microsoft TechNet.

As a final note, not all databases are relational. If you have a few minutes to spare and are curious, check out non-relational databases which are sometimes referred to as NoSQL databases.

It goes without saying that I have barely scratched the surface of relational databases. I'll leave it there for now and save the rest for another day so stay tuned.