TechiePen: June 2016

Tuesday, June 28, 2016

Loading Datas using Netezza Utilities

Lets look at some of the Utilities available in Netezza

Nzload :

nzload is a command line utility which can be run in the local or remote client.Nzload commad is used for bulk loading of data into tables .It is like external table but this works faster and multiple files can be parelley loaded into tables.For loading a file first create the table and then use the nzload command to load into table

Syntax for nzload :

nzload -db <database name> -h <host name> -u <username> -pw <password>

-t <tablename> -df <datafilename> -delim <delmiter> -dateDelim <date delimiter in single quotes> -dateStyle <date format> -bf <bad filename> -lf <log filename>

nzload command returns the following status :

0--> the data is loaded successfully into the target table

1--> the data is loaded into the table but with errors and the errors are within the maxErrors option

2-->nzload failed

Other options available are

maxErrors-->we can set the limit for errors

Skiprows --> for skipping header records

Fillrecord-->fills the trailing spaces to nulls

ignoreZero-->ignores zero byte record

Badfile and Logfile :

nzload command generates log and bad files which will be in the format <tablename>.<database name>.bad/log.the files by default will be present in the working directory. we can override the defualt location by specifying the location in outputDir parameter .We can override the log and bad file usinf -lf and -bf parameters.Nzload internally creates an external table ,loads the data into the table and then drop the table after loading .Statistics are automatically generated for these tables upon completion

External tables :

External tables can also be used for loading data into the table

statistics are automatically generated for an external table.we can drop or insert records into an external table.However we cannot delete or truncate an external table.Also we cant use it in union operation or for multiple table joins.When we create an external table ,it is a logical concept and the data actually resides in the file.When we select or insert records into external table data gets added in the file associated with the table.

Syntax for external table :

CREATE EXTERNAL TABLE <table_name>
[sameas table] | ([column_name [,...]])
USING (DATAOBJECT (<host_file_name>) DELIMITER <field_delimiter>) [options]

The options are same as that of unload command._v_external and _v_extobject can be used to get the

meta data about the external tables.

NZSQL:

Execution Plan in Netezza

Cost Based Optimizer :

Netezza also generates an execution plan like other databases.Netezza has a cost based optimizer which creates a plan based on the query and then exeutes the query with the least cost one.

Optimizer generates the plan based on number of statistics some of which are described below

1.Number of rows in a table
2.Minimum and Maximum value of each column involved in the query
3.Number of extends in each table and the total number of extends in each data slice with the largest skew

Plan generation :

When a query is generated it automatically generates the plan and store it under the directory nz/data/plan directory in the host and the c code will be under the directory nz/data/cache .Cache directory contains the code for the recent query to cmpare them with the new query to prevent re compilation of the same query

We can also view the explain plan for the query to check whether everything is inline using the below commands

explain verbose <sql query>;
explain plantext <sql query>;
explain plangraph <sql query>;

There are several points to check in the explain plan :

Estimated cost which shows the cost of the query .If the cost is very high we might have to revisit the code to ensure that it is correct.
Estimated rows which shows the number of rows.If the estimated rows is less than expected it means we might have to generate the statistics for the objects
Percentage estimation of the cost.
Restrictions - which shows the where clause used in the query
Projections - The columns used in the query
SPU Distribute on column - this means that the table is redistributed based on the column .We need to check whether proper column is used for re distribution.
SPU Broadcast - This means that the copy of the entire table is made available in all the snippets.This is more suitable for a large fact table joined with small dim tables.
Optimizers performs hash,merge and nested loop joins of which hash joints are more efficient .We can check in the plan for the type of joins used .If a floating point integer is joined with an integer,then hash join may not be implemented where as it is expected.

Main Cause of performance issues in NZSQL:

1.Table skew- In correct distribution of the key resulting in more data being stored in one disc when compared to the others.
2.Process skew- even if the table is distributed evenly,fetchign data from only one slice may result in process skew
3.Zone maps not being assigned properly .Only for Integer columns zone map will be defined.so distributing on varchar or varchar2 will not create zone maps
4.Proper keys not beign used resulting in re distribution or broadcast of big fact tables