Tuesday, July 1, 2014

Netezza Basics


Important things to note in the Netezza Architecture :

 Host :
          
The host is a high performance Linux server setup in the active passive  mode.This host is responsible for compiling the SQL queries into executable blocks called snippets,for creating an optimized query plan,distributes th snippets to nodes for execution.  

Snippet Blades or S blades :

Sblades or snippet blades are independent server containing multi core CPU s,multiple engine FPGA and its own memory all designed to work concurrently to deliver peak performance .Each snippet is connected to a set of 8 disks.

Field Programmable Gate Array:

FPGA is the key component in the Netezza architecture.It has the following engines embedded in it .

Compress engine - This uncompress the data in wire speed transforming  each block on disk into 4 to 8 blocks in memory.

Project and Restrict engine- This further increases the performance by filtering out the rows and columns mentioned in the select and the where clause.

Visibility engine -Filters out the rows that should not be seen by the Query ie the data that is not committed.

Disk enclosures :

The disks are high density ,high performance ones.Each table data is uniformly distributed across various disks.A high speed network connects the disk with the s blades so that the data gets streamed at maximum rate possible.

Optimizer:

The host compiles the query and generates an execution plan.The optimizer intelligence is a key factor in performance of the query .The optimizer makes use of all the nodes in order to get an up to date statistics of all database objects referenced in a query.Another example of optimizer efficiency is in calculating the join order.if for eg a small table is joined against all the fact tables ,the optimizer can showcase the small table to all the s blades while keeping the large fact table distributed across the snippets.This approach minimizes the data movement while taking advantage of the parallel processing.The optimizer minimizes the I/O and data movement ,the two factor slowing performance in the warehouse system.The other functions of optimizer includes

Determining the correct join order
Rewriting expressions
Removing redundancy in the SQL operations

Compiler :

The compiler converts the query plan into executable segments known as snippets which are executed in parallel.The intelligence of the compiler is that it has a feature called object cache which is a large cache of previously compiled snippet code with parameter variation.This will eliminate compilation for many snippets.

When an SQL query is executed the following events take place.
  • The optimizer generates an execution plan and the compiler creates scheduled task called snippets.
  • The data is moved from the disks to the corresponding S blades in a compressed manner through a high speed network.
  • The snippet processor reads the table data into memory utilizing a technique. called as zone map which will reduce disk scans by storing the  minimum and maximum value and hence it avoids fetching the data out of range.The details of zone map unlike indexes are created and updated automatically.
  • The compressed data is cached in the memory using a smart algorithm which will make the most accessible data to be available instantly rather than fetching from the disk.
  • The data then moves to FPGA field programmable gate array responsible for uncompressing data ,extracting the data and then applies the filter condition and pass it to the CPU 
  • The CPU performs other operations like Joins and the results of each snippet is sent to the host which does the final calculation and pass down the results to the end user .
Here,Most of the processing happens in the Hard disk and less in CPU.  

Disk has data ,mirror data and free space .So even if one of the disk fails it obtains data from the mirror of the other disk .  

When a table is created pieces of table will be distributed across all the discs and hence the data is fetched at a faster rate and Parallel processing is achieved in that way . 

DMLs in Netezza :

In Netezza the DML such as insert,update and delete are autocommit and it gets commited once you execute the query.However once you delete or update a row the data is not completely lost.

Rollback of a delete :

If we have accidentally deleted a record and need to rollback we can set an option such as set show_deleted_records=true

Now if we select deletexid from the table where deletexid !=0 we will get the data for that row with a transaction id and then we can re insert that record into the table

insert into t select * from t where deletexid=123;

Rollback of an Update :

If you update a table,a record is deleted and a new record will be inserted .so there will be a deletexid and insertxid populated for the record.to recover or rollback the delete all we need is to delete the record inserted with that transaction id and insert the record with the update statements deletexid

In both the case if a groom table is issued then the table will be updated  with the latest stats and a rollback may not be possible.

Also A truncate table will remove the records permanently and hence cannot be roll backed.


Tuning of NZSQL:

1.Always distribute the tables with a proper key.Integer values which are having high cardinality is a good choice for distribution.
2.When joining two tables always use the common column as distribution key in both the tables.for eg dept id is primary key in dept table and foreign key in employees tables use dept id column distribution key in both the tables.also the datatype of both the tables key should be same.
3.even after using proper distribution key some of the joins might be long
running.check whether any distinct value has more records eg -1 or Unknown.if those records are high then the result might take longer time.try to create random numbers for these in a seperate table and associate them with the main table.
4.Always try to broadcast the small table to improve the performance.you can set enable_factrel_planner = true and then set factrel_size_threshold high like 15000000, which says anything under 15 million rows is more of a dimension so broadcast or redistribute it, and tables over 15M rows are big facts so try to leave in place and don’t redistribute.
5.Always run the groom table and groom table version.Also generate the statistics of the objects created.