SQL Server Collation

Question: I was putting together a small bit of SQL code the other day, and I was running into a bug. While debugging it, I was completely shocked to discover:

IF 'a' < 'A'
SELECT 'a < A'

IF 'a' = 'A'
SELECT 'a = A'

IF 'a' > 'A'
SELECT 'a > A'

IF 'a' <> 'A'
SELECT 'a <> A'

displays only 'a = A'.

I investigated and found out that the reason for this result is that the database's collation sequence is set to one of the "CI" values (Case Insensitive). I took a look at a couple of other databases, and they were also "CI". Oddly enough, they were slightly different collation sequence values, but definitely "CI". This includes the [master] database.

Yes, I know it could have been configured the other way. That's not the point of this note. Is "CI" the expected standard configuration?

In addition to the crazy logic implications, I've got to believe that it's at least a bit slower to execute, since all text-based columns needs to be compared in a case-insensitive way. That character set translation can't come for free.


Answer:
The collation setting is specified during the installation of SQL Server. As seen in Figure 1, the default settings used during the installation of SQL Server 2012 are:

Database Engine: SQL_Latin1_General_CP1_CI_AS
Analysis Services: Latin1_General_CI_AS


Figure 1: SQL Server 2012 Default Collation Settings

Technically, there is no "expected CI standard". It all depends on the collation setting of the database. This is not to say you're stuck with that setting forever. If you visit http://msdn.microsoft.com/en-us/library/ms143726.aspx, you will see the following excerpt:

Setting collations are supported at the following levels of an instance of SQL Server:
Server-level collations
The default server collation is set during SQL Server setup, and also becomes the default collation of the system databases and all user databases. Note that Unicode-only collations cannot be selected during SQL Server setup because they are not supported as server-level collations. After a collation has been assigned to the server, you cannot change the collation except by exporting all database objects and data, rebuilding the master database, and importing all database objects and data. Instead of changing the default collation of an instance of SQL Server, you can specify the desired collation at the time that you create a new database or database column.

Database-level collations
When a database is created or modified, you can use the COLLATE clause of the CREATE DATABASE or ALTER DATABASE statement to specify the default database collation. If no collation is specified, the database is assigned the server collation.
You cannot change the collation of system databases except by changing the collation for the server. The database collation is used for all metadata in the database, and is the default for all string columns, temporary objects, variable names, and any other strings used in the database. When you change the collation of a user database, know that there is the potential for collation conflicts when queries in the database access temporary tables. Temporary tables are always stored in the tempdb system database, which will use the collation for the instance. Queries that compare character data between the user database and tempdb may fail if the collations cause a conflict in evaluating the character data. You can resolve this by specifying the COLLATE clause in the query. For more information, see COLLATE (Transact-SQL).

Column-level collations
When you create or alter a table, you can specify collations for each character-string column by using the COLLATE clause. If no collation is specified, the column is assigned the default collation of the database.

Expression-level collations
Expression-level collations are set when a statement is run, and they affect the way a result set is returned. This enables ORDER BY sort results to be locale-specific. Use a COLLATE clause such as the following to implement expression-level collations:

SELECT name FROM customer ORDER BY name COLLATE Latin1_General_CS_AI;

For more information on changing collation, please visit http://msdn.microsoft.com/en-us/library/ms179254.aspx.

Comments

Mark said…
1) The default collation chosen is dependent on the Windows system locale (see http://msdn.microsoft.com/en-us/library/ms143508(SQL.105).aspx).

2) Even given that, virtually ALL default collations are CI; just a tradition on SQL databases. I would prefer it, as I want SELECT * FROM tbl where FirstName = 'sam' to catch both 'sam' and 'Sam'.

3) However, from that BOL article above, notice that only US is setup with a SQL_* collation. MS says its for backwards compatibility. But there is a difference when sorting or matching on unicode strings. As Erland Sommarskog points out on this discussion http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/bfdc32d3-3d36-4d63-8d87-6ee972fd8130/

"A statment like:

SELECT * FROM tbl WHERE col LIKE '%abc%'
will perform about seven times faster with an SQL collation than with a
Windows collation.

But there is also a devilish trap with SQL collation, which is a lot easier to walk into that you might believe. Consider:

SELECT * FROM tbl WHERE indexedvarcharcol = @nvarcharval
Since varchar has lower precedence than nvarchar, you will get an implicit conversion. With a Windows collations, varchar is just a subset of nvarchar, so the index is still usable, althouught can expect an overhead of a factor 2 or 3.

But with an SQL collations there are differences in the rules, so the index cannot be seeked and the cost can be a factor of 100 or 1000 or more depending on the size of the table."

My guess, MS leaves SQL Server as this collation partly for backwards compatibility, and partly for performance with (predominantly US) testers, evaluators, and reveiwers. If using SQL collation is 7x faster, simple tests made with varchar columns (also default) will turn out better for SQL Server than otherwise. It's up to us to decide and use the best collation for our servers/databases.

For me, my preference is the default. Just because most all servers/databases/columns are setup that way, and trying to do queries across different collations is a nightmare (had that unfortunate problem at one client). It's just easier not thinking about that part, and remembering to always make sure my varchar-nvarchar comparisons are explicit and well-written to not have a performance issue.

Just my 2¢
Mark said…
1) The default collation chosen is dependent on the Windows system locale (see http://msdn.microsoft.com/en-us/library/ms143508(SQL.105).aspx).

2) Even given that, virtually ALL default collations are CI; just a tradition on SQL databases. I would prefer it, as I want SELECT * FROM tbl where FirstName = 'sam' to catch both 'sam' and 'Sam'.

3) However, from that BOL article above, notice that only US is setup with a SQL_* collation. MS says its for backwards compatibility. But there is a difference when sorting or matching on unicode strings. As Erland Sommarskog points out on this discussion http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/bfdc32d3-3d36-4d63-8d87-6ee972fd8130/

"A statment like:

SELECT * FROM tbl WHERE col LIKE '%abc%'
will perform about seven times faster with an SQL collation than with a
Windows collation.

But there is also a devilish trap with SQL collation, which is a lot easier to walk into that you might believe. Consider:

SELECT * FROM tbl WHERE indexedvarcharcol = @nvarcharval
Since varchar has lower precedence than nvarchar, you will get an implicit conversion. With a Windows collations, varchar is just a subset of nvarchar, so the index is still usable, althouught can expect an overhead of a factor 2 or 3.

But with an SQL collations there are differences in the rules, so the index cannot be seeked and the cost can be a factor of 100 or 1000 or more depending on the size of the table."

My guess, MS leaves SQL Server as this collation partly for backwards compatibility, and partly for performance with (predominantly US) testers, evaluators, and reveiwers. If using SQL collation is 7x faster, simple tests made with varchar columns (also default) will turn out better for SQL Server than otherwise. It's up to us to decide and use the best collation for our servers/databases.

For me, my preference is the default. Just because most all servers/databases/columns are setup that way, and trying to do queries across different collations is a nightmare (had that unfortunate problem at one client). It's just easier not thinking about that part, and remembering to always make sure my varchar-nvarchar comparisons are explicit and well-written to not have a performance issue.

Just my 2¢

Popular posts from this blog

Working with double quotes (") in VB.Net

Using VB Script to display part of an IP address

Object reference not set to an instance of an object when using HTMLEditorExtender