Sunday, September 25, 2016

Common Table Expressions in SQL Server

This is an interesting topic that I came across while working with Microsoft SQL server and it is the concept of common table expressions.


Common Table Expressions generally referred to as CTE is a key concept in SQL Server. It is a temporary named resultset that can be referenced within a DML operation like Insert/Delete/Update/Merge. Suppose we have to do a join of table and then join the above resultset with another table. There are 3 ways of doing it .One is to create a table with join of 2 tables and then in another sql query join the above created table with that of the second table. It is a feasible solution .But creating lot of standard tables increases the complexity and result in confusion. Also if the intermediate result is of no use later then there is no point in doing so. The second is creating a complex SQL query with many subquerys which will do the functionality. Yes, it is also a viable solution.But from a third persons perspective this may sound weird since he may not be able to understand the functionality easily.Hence we finally arrive at the third solution which is the CTE


There are 2 types of CTE supported by SQL server and they are recursive CTE s and non recursive CTE s.Lets first look at non recursive CTE s


Non Recursive CTE s :

           Non Recursive CTE s are temporary resultset that reference other tables and not itself. Lets look at the syntax for this
 
with t1(c1,c2)
as
    (
       select m1,sum(m2)
       from t 2
       group by m1
    )
      select c1,a1,sum(c2)
      from t3 join t1
      on t3.a1=t1.c1
      group by c1,a1;


This may sound confusing. Let us look into the query in detail. I have created a CTE which is table t1 with columns c1 and c2. this means that the query by which the CTE is created should return 2 columns of the same datatype . I am using the columns m1 and their aggregation which is the sum from another table into the CTE which is the t1 .so T1 has 2 columns which is the m1 column of t2 and the aggregated value of m2 .Now after creating the CTE am joining the CTE table that is t1 with another table t3 and pulling the values form both the tables. Now this would have become more clear


Lets look at it further with the help of employees, departments and locations tables, If I want to find the location of an employee and their sum of salaries in each location based on each department I need to join the employees and departments table and then join it with the locations table. Lets use a CTE for that now


with temp_dep (emp_id,dept_id,emp_name,sal)
as
   (
     select emp_id, dept_id,emp_name,sum(salary) as sal
     from employees
    group by emp_id,emp_name,dept_id,
  )
     select location,emp_name,sal
     from locations l join temp_dep d
     on l.dept_id=d.dept_id

This would give me the location and their sum(salary).This made the understanding of the query more easy since we have two definitions and by looking closer we know what does each query intended to do.


Recursive CTE :


                   Recursive CTE s are those that reference itself .Like finding a hierarchy we might create a GTE and refer that in the table creation itself. We might have learned self join where in we join employees with the same table employees and use first tables employee id = second tables manager id condition. The same logic applies here .Lets see an example


WITH
  cteReports (EmpID, FirstName, LastName, MgrID, EmpLevel)
  AS
  (
    SELECT EmployeeID, FirstName, LastName, ManagerID, 1
    FROM Employees
    WHERE ManagerID IS NULL
    UNION ALL
    SELECT e.EmployeeID, e.FirstName, e.LastName, e.ManagerID,
      r.EmpLevel + 1
    FROM Employees e
      INNER JOIN cteReports r
        ON e.ManagerID = r.EmpID
  )
SELECT
  FirstName + ' ' + LastName AS FullName,
  EmpLevel,
  (SELECT FirstName + ' ' + LastName FROM Employees
    WHERE EmployeeID = cteReports.MgrID) AS Manager
FROM cteReports
ORDER BY EmpLevel, MgrID


Here the first section mentioned in blue is the anchor member and the second section highlighted in violet is the recursive section. The recursive section as we can see refer the same CTE table that we created and that section can only refer a CTE .Both the Anchor member and the recursive members are joined by Union/Union all/Intersect or by any set operators. All the Anchor members must precede the recursive members and both must return the same number of columns with matching datatype. This may sound little complex,but when it comes TSQL it is very useful for doing any operations on the tables like select,insert,update,delete and so on.           
              







Saturday, September 10, 2016

Stored procedure in Netezza

NZPLSQL

There are several blogs for creating anonymous or named blocks of code in all the other databases be it Oracle or MS SQL,but I could see a very few post about Netezza. This could be primarily because Netezza is not being used by many companies and also the usage of named or Anonymous blocks might not come up in Netezza very frequently (since it is primarily used for data mining on large data volumes which can simply be done using a single NZSQL query). Based on my knowledge and work experience with Netezza let me jot down what comes to mind .

Like all the other DB s Netezza also has Anonymous blocks,Stored Procedures,Collection,Functions and so on.Lets examine one by one and also try to differentiate it with other DB's as well.

Anonymous blocks :
     
               There would be declare section and a begin and end section in the Anonymous block and an optional exception session as well.Any block of code that lies in between the begin and end would be executed.The main purpose of Anonymous block could be to execute other named blocks or to execute some blocks for testing purpose which need not be stored as an object in the Database.

DECLARE

       Aid Integer;
       Abc Boolean;
       Pname character varying(100);

BEGIN

          Pname := 'Harvard Thomas';
     
          RAISE NOTICE ' The name of the person is %',Pname;

END;

Upon executing the block the name would be printed with the notice statement .This works perfectly for testing scenarios or for creating another named block and so on.lets examine that at a Later point of time.

Stored Procedure :

stored procedure is one of the most important named block in any Database Query language.The main purpose of using stored procedure is that it gets compiled and stored in the DB as an object and can be reused at a later point of time.If our logic requires us to use loops,branches or any other statements that cannot be executed in one single query ,we might have to do it through a stored procedure.

The syntax of Stored procedure is slightly different in Netezza and lets examine that

CREATE OR REPLACE PROCEDURE my_testing() returns varchar
LANGUAGE NZPLSQL AS
BEGIN_PROC
    DECLARE
          string varchar;
    BEGIN
          string :='This string is wrong';
          RETURN string;
    END;
END_PROC;


I have highlighted the keywords in caps .The create or replace procedure followed by the procedure name (here it is my_testing) and the parenthesis says the procedure does not accept any parameters and then returns varchar means that the procedure will return a character value.The keyword language nzplsql is unique to Netezza which says the programming language is nzplsql and the begin_proc and the end_proc are the opening and the ending blocks in nzplsql and inside this there is an anonymous block with declare begin and end.Here we are declaring a string which is a character and in the body assigning a value 'this string is wrong' to that string.

This is a pretty simple procedure that accepts no parameter and returns the string.

Executing the procedure:

we can execute the procedure my_testing using one of the following commands

CALL  my_testing();
EXEC my_testing();
EXECUTE my_testing();
EXECUTE PROCEDURE my_testing();
select my_testing();

One thing to note here is that the select cannot have any other clause like from or where .

Viewing the procedure :

Privileged users can use the command show procedure for obtaining the code for procedures.It accepts the following arguments.

SHOW PROCEDURE VERBOSE;
SHOW PROCEDURE ALL;
SHOW PROCEDURE PROCEDURE_NAME;

  command to view the contents of the procedure .We can also view the procedure using NZADMIN  tool to view the procedure .

Altering the procedure :

We can use the below command and  do the modification to the procedure body if needed.

ALTER PROCEDURE <procedure_name> 

Dropping the procedure :

We can either hide or drop a procedure .For dropping the procedure, use
DROP PROCEDURE  <procedure name >

Now that we have covered the basics of how to Create,Alter,Drop or View a procedure,we can go a little deep into the additional features available with Netezza for stored procedures.


Security Aspects while creating a procedure :

One of the key concepts of creating a named block is the security concerns it possess.A procedure may access several DB objects inside the body like Tables,views etc.If the table data needs to be prevented from the user then the stored procedure should be designed such a way that the data in the table is not accessible to all,but only to the person who created the table.For this there is a feature in Netezza which is EXECUTE AS CALLER 

Lets see the syntax for that 

CREATE OR REPLACE PROCEDURE my_testing() returns varchar
LANGUAGE NZPLSQL
EXECUTE AS (CALLER | OWNER)
BEGIN_PROC
     DECLARE 
     BEGIN
     END;
END_PROC;

Here the EXECUTE AS CALLER means that the procedure will be executed with the permissions available for the caller and hence the underlying data in the table will be accessible to the user only if he has permissions on the table .

The default option is EXECUTE AS OWNER is the default option while creating the procedure and this means that the procedure will execute with the owners permissions and whoever runs the procedure can view the data in the underlying table since it has the owners permissions.

Hence if we are concerned about the security aspects then we should go for the first option ie execute as caller while designing the procedure.There is another important aspect in security ie what permissions a user should be given.So If you happen to be an Admin then this is one of your major concerns since this issue can caue a huge havoc.Permissions can be granted to a procedure in the following way

GRANT CREATE PROCEDURE  TO abc;

This implies that now abc can create a procedure .

GRANT CREATE PROCEDURE TO group dev;

This means that all the users in the group dev can now create a procedure 

GRANT EXECUTE ON my_testing() to abc;

This means that now abc can create any procedure but execute only my_testing()

we can also revoke the permissions on a procedure using the revoke command 

REVOKE CREATE PROCEDURE from abc;
REVOKE EXECUTE ON my_testing() from ABC;

Auto commit in NZPLSQL procedures:

we have two options available in Netezza  ie auto commit off/on to specify whether we need to execute the statements one by one or by a single block,The default mode is auto commit off in which the entire procedure will run as a single command and if we need commit/rollback on particular statement we can enable the auto commit on statement for that .

Altering the above procedure to include auto commit we can write the code as \

CREATE OR REPLACE PROCEDURE my_testing() returns varchar
LANGUAGE NZPLSQL
EXECUTE AS (CALLER | OWNER)
BEGIN_PROC
     DECLARE AUTOCOMMIT (ON| OFF)
     BEGIN
     END;
END_PROC;

lets look at the difference between auto commit on and off in more detail.In normal plsql code when we execute the procedure it executes as a single block 

eg 

BEGIN
       statement1;
       commit;
       statement2;
      rollback;
      statement3;
END;

Assume the above code is run in auto commit off mode.in this case the statement 1 will execute and then commit the transaction.After that statement 2 will execute and rollback and statement3 will execute and it will not be committed/rollbacked. when I commit the procedure after executing then the third statement will be committed.In case if an error occurred in the first statement then the procedure will not execute.If the error is on the second statement then first statement will be committed and the second statement will not execute and the control will come back to the main.If the error is say on the third statement then the first statement will be committed and the second will be rollbacked and the control will come to the main.Even if we give a commit after calling the procedure the second statement will be rollbacked.

Now let us execute the statement using auto commit on and the block need not require commit or rollback statements here .

BEGIN
      statement 1;
      statement 2;
      rollback;
     statement 3;
END;

Here the statement 1 will be executed and then it will be automatically committed since we have enabled the auto commit on and then the statement 2 will execute and that also will be committed irrespective of the roll back command issued and then statement 3 also will be committed.If any of the statements fail the control will come back to the main with the previous transactions committed. Roll backing the commands after executing will not work here.

So the question is when should we go for Auto commit on/off.It is always safe to use auto commit off since the major transactions can be rechecked before committing them.But in case of any auditing purpose we need to track all the operations on a table we can go for auto commit on.




















Tuesday, June 28, 2016

Loading Datas using Netezza Utilities

Lets look at some of the Utilities available in Netezza 

Nzload :

nzload is a command line utility which can be run in the local or remote client.Nzload commad is used for bulk loading of data into tables .It is like external table but this works faster and multiple files can be parelley loaded into tables.For loading a file first create the table and then use the nzload command to load into table

Syntax for nzload :

nzload -db <database name> -h <host name> -u <username> -pw <password>
-t <tablename> -df <datafilename> -delim <delmiter>  -dateDelim <date delimiter in single quotes> -dateStyle <date format> -bf <bad filename> -lf <log filename>


nzload command returns the following status :

0--> the data is loaded successfully into the target table
1--> the data is loaded into the table but with errors and the errors are within the maxErrors option
2-->nzload failed

Other options available are 

maxErrors-->we can set the limit for errors
Skiprows --> for skipping header records
Fillrecord-->fills the trailing spaces to nulls
ignoreZero-->ignores zero byte record 

Badfile and Logfile :

nzload command generates log and bad files which will be in the format <tablename>.<database name>.bad/log.the files by default will be present in the working directory. we can override the defualt location by specifying the location in outputDir parameter .We can override the log and bad file usinf -lf and -bf parameters.Nzload internally creates an external table ,loads the data into the table and then drop the table after loading .Statistics are automatically generated for these tables upon completion

External tables :

External tables can also be used for loading data into the table 
statistics are automatically generated for an external table.we can drop or insert records into an external table.However we cannot delete or truncate an external table.Also we cant use it in union operation or for multiple table joins.When we create an external table ,it is a logical concept and the data actually resides in the file.When we select or insert records into external table data gets added in the file associated with the table.

Syntax for external table :

CREATE EXTERNAL TABLE <table_name>
[sameas table] | ([column_name [,...]])
USING (DATAOBJECT (<host_file_name>) DELIMITER <field_delimiter>) [options]
 
The options are same as that of unload command._v_external and _v_extobject can be used to get the  
 meta data about the external tables.

NZSQL:






Execution Plan in Netezza

Cost Based Optimizer :

Netezza also generates an execution plan like other databases.Netezza has a cost based optimizer which creates a plan based on the query and then exeutes the query with the least cost one.

Optimizer generates the plan based on number of statistics some of which are described below

1.Number of rows in a table
2.Minimum and Maximum value of each column involved in the query
3.Number of extends in each table and the total number of extends in each data slice with the largest skew

Plan generation :

When a query is generated it automatically generates the plan and store it under the directory nz/data/plan directory in the host and the c code will be under the directory nz/data/cache .Cache directory contains the code for the  recent query to cmpare them with the new query to prevent re compilation of the same query

We can also view the explain plan for the query to check whether everything is inline using the below commands

explain verbose <sql query>;
explain plantext <sql query>;
explain plangraph <sql query>;

There are several points to check in the explain plan :

  • Estimated cost which shows the cost of the query .If the cost is very high we might have to revisit the code to ensure that it is correct.
  • Estimated rows which shows the number of rows.If the estimated rows is less than expected it means we might have to generate the statistics for the objects
  • Percentage estimation of the cost.
  • Restrictions - which shows the where clause used in the query
  • Projections -  The columns used in the query
  • SPU Distribute on column - this means that the table is redistributed based on the column .We need to check whether proper column is used for re distribution.
  • SPU Broadcast - This means that the copy of the entire table is made available in all the snippets.This is more suitable for a large fact table joined with small dim tables.
  • Optimizers performs hash,merge and nested loop joins of which hash joints are more efficient .We can check in the plan for the type of joins used .If a floating point integer is joined with an integer,then hash join may not be implemented where as it is expected.
Main Cause of performance issues in NZSQL:

1.Table skew- In correct distribution of the key resulting in more data being stored in one disc when compared to the others.
2.Process skew- even if the table is distributed evenly,fetchign data from only one slice may result in process skew
3.Zone maps not being assigned properly .Only for Integer columns zone map will be defined.so distributing on varchar or varchar2 will not create zone maps
4.Proper keys not beign used resulting in re distribution or broadcast  of big fact tables