Data Modeling

1) What is a data warehouse?
A data warehouse (DW) is a database used for reporting. The data is offloaded from the operational systems for reporting.

2) Describe its architecture.
Operational database layer:
The source data for the data warehouse — An organization's Enterprise Resource Planning systems fall into this layer.
Data access layer:
The interface between the operational and informational access layer — Tools to extract, transform, load data into the warehouse fall into this layer.
Metadata layer:
The data dictionary — This is usually more detailed than an operational system data dictionary. There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can be accessed by a particular reporting and analysis tool.
Informational access layer:
The data accessed for reporting and analyzing and the tools for reporting and analyzing data — This is also called the data mart. Business intelligence tools fall into this layer.

3) OLTP v/s OLAP
IT systems can be divided into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it.

The following table summarizes the major differences between OLTP and OLAP system design.


OLTP System
Online Transaction Processing
(Operational System)

OLAP System
Online Analytical Processing
(Data Warehouse)

Source of data
Operational data; OLTPs are the original source of the data.
Consolidation data; OLAP data comes from the various OLTP Databases
Purpose of data
To control and run fundamental business tasks
To help with planning, problem solving, and decision support
What the data
Reveals a snapshot of ongoing business processes
Multi-dimensional views of various kinds of business activities
Inserts and Updates
Short and fast inserts and updates initiated by end users
Periodic long-running batch jobs refresh the data
Queries
Relatively standardized and simple queries Returning relatively few records
Often complex queries involving aggregations
Processing Speed
Typically very fast
Depends on the amount of data involved; batch data refreshes and complex queries may take many hours; query speed can be improved by creating indexes
Space Requirements
Can be relatively small if historical data is archived
Larger due to the existence of aggregation structures and history data; requires more indexes than OLTP
Database Design
Highly normalized with many tables
Typically de-normalized with fewer tables; use of star and/or snowflake schemas
Backup and Recovery
Backup religiously; operational data is critical to run the business, data loss is likely to entail significant monetary loss and legal liability
Instead of regular backups, some environments may consider simply reloading the OLTP data as a recovery method
source: www.rainmakerworks.com

4) Data warehouse versus Database

A database is a system intended to organize, store, and retrieve large amounts of data easily.

A data warehouse is a relational/multidimensional database that is designed for query and analysis rather than transaction processing. A data warehouse usually contains historical data that is derived from transaction data. It separates analysis workload from transaction workload and enables a business to consolidate data from several sources.

5) There are three types of data warehouses:
1. Enterprise Data Warehouse - An enterprise data warehouse provides a central database for decision support throughout the enterprise.
2. ODS(Operational Data Store) - This has a broad enterprise wide scope, but unlike the real enterprise data warehouse, data is refreshed in near real time and used for routine business activity.
3. Data Mart - Datamart is a subset of data warehouse and it supports a particular region, business unit or business function.

6) Data Mart: A data mart is a subset of data warehouse that is designed for a particular line of business, such as sales, marketing, or finance. In a dependent data mart, data can be derived from an enterprise-wide data warehouse. In an independent data mart, data can be collected directly from sources.

7) ETL Tools.: Informatica, Ab Initio, Sybase ETL, SAS, Clover, Jasper, Datastage, Oracle WorkBuilder...

8) Databases: Oracle, SQL Server, DB2 (IBM), Teradata (NCR Group), Access.

9) Front End Tools: Toad, SQL Developer (Oracle, SQL Server), Teradata SQL Assistant, TSO/SPUFI/Access (DB2)

10) Reporting Tools: Cognos, BO, Micro Strategy

11) What is a Schema

12) What is Data Modeling?
Data modeling is the process of creating a data model by applying formal data model descriptions using data modeling techniques. (Organizing the data)
A data model is an abstract model, that documents and organizes the business data for communication between team members and is used as a plan for developing applications, specifically how data is stored and accessed.

13)  Data Modeling Tools:
There are a number of data modeling tools to transform business requirements into logical data model, and logical data model to physical data model. From physical data model, these tools can be instructed to generate sql code for creating database.

Tool Name           Company Name
Erwin                     Computer Associates
Embarcadero         Embarcadero Technologies
Rational Rose        IBM Corporation
Power Designer    Sybase Corporation
Oracle Designer    Oracle Corporation
Xcase                   RESolution LTD.

14) Types of Data Modeling:
A data model instance may be one of three kinds:
  • Conceptual schema:  (Entities) describes the semantics of a domain, being the scope of the model.Conceptual data model includes all major entities and relationships and does not contain much detailed level of information about attributes and is often used in the INITIAL PLANNING PHASE.
  • Logical schema: (Attributes) describes the structure of some domain of information. This consists of descriptions of tables and columns, object oriented classes, and XML tags, among other things. This is the actual implementation and extension of a conceptual data model. A Logical data model is the version of a data model that represents the business requirements(entire or part) of an organization and is developed before the physical data model.
  • Physical schema: describes the physical means by which data are stored. This is concerned with partitions, CPUs, tablespaces, and the like. Physical data model includes all required tables, columns, relationships, database properties for the physical implementation of databases. Database performance, indexing strategy, physical storage and denormalization are important parameters of a physical model.
15) Normalization forms
It is the process of organizing data to minimize redundancy is called normalization. The goal of database normalization is to decompose relations with anomalies in order to produce smaller, well-structured relations. Normalization usually involves dividing large, badly-formed tables into smaller, well-formed tables and defining relationships between them. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships.

The main normal forms are summarized below.
Normal form Defined by Brief definition
First normal form (1NF) Two versions: E.F. Codd (1970), C.J. Date (2003)[11] Table faithfully represents a relation and has no repeating groups
Second normal form (2NF) E.F. Codd (1971)[12] No non-prime attribute in the table is functionally dependent on a proper subset of a candidate key
Third normal form (3NF) E.F. Codd (1971)[13]; see +also Carlo Zaniolo's equivalent but differently-expressed definition (1982)[14] Every non-prime attribute is non-transitively dependent on every candidate key in the table
Boyce–Codd normal form (BCNF) Raymond F. Boyce and E.F. Codd (1974)[15] Every non-trivial functional dependency in the table is a dependency on a superkey
Fourth normal form (4NF) Ronald Fagin (1977)[16] Every non-trivial multivalued dependency in the table is a dependency on a superkey
Fifth normal form (5NF) Ronald Fagin (1979)[17] Every non-trivial join dependency in the table is implied by the superkeys of the table
Domain/key normal form (DKNF) Ronald Fagin (1981)[18] Every constraint on the table is a logical consequence of the table's domain constraints and key constraints
Sixth normal form (6NF) C.J. Date, Hugh Darwen, and Nikos Lorentzos (2002)[4] Table features no non-trivial join dependencies at all (with reference to generalized join operator)

16) A table can have an xref as prefix to indicate it is a cross-reference table that joins two tables together via primary key.

17) Trigger:
A database trigger is procedural code that is automatically executed in response to certain events on a particular table or view in a database. The trigger is mostly used for keeping the integrity of the information on the database. For example, when a new record (representing a new worker) is added to the employees table, new records should be created also in the tables of the taxes, vacations, and salaries.

18) Fact Table
The centralized table in a star schema is called as FACT table. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.
Eg: "Sales Dollar" is a fact(measure) and it can be added across several dimensions. Fact tables store different types of measures like additive, non additive and semi additive measures.

    * Measure Types Additive - Measures that can be added across all dimensions.
    * Non Additive - Measures that cannot be added across all dimensions.
    * Semi Additive - Measures that can be added across few dimensions and not with others.

A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables). In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called as Factless Fact tables.

Steps in designing Fact Table Identify a business process for analysis(like sales):
   * Identify measures or facts (sales dollar).
    * Identify dimensions for facts(product dimension, location dimension, time dimension, organization dimension).
    * List the columns that describe each dimension.(region name, branch name, region name).
    * Determine the lowest level of summary in a fact table(sales dollar).

19) Dimensional Table:

20) Granularity: Lowest form of information

21) Star Schema v/s Snowflake Schema:
Star Schema is a relational database schema for representing multidimensional data. It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables. It is called a star schema because the entity-relationship diagram between dimensions and fact tables resembles a star where one fact table is connected to multiple dimensions. The center of the star schema consists of a large fact table and it points towards the dimension tables. The advantage of star schema are slicing down, performance increase and easy understanding of data.

    * Steps in designing Star Schema Identify a business process for analysis(like sales).
    * Identify measures or facts (sales dollar).
    * Identify dimensions for facts(product dimension, location dimension, time dimension, organization dimension).
    * List the columns that describe each dimension.(region name, branch name, region name).
A snowflake schema is a term that describes a star schema structure normalized through the use of outrigger tables. i.e dimension table hierachies are broken into simpler tables. In star schema example we had 4 dimensions like location, product, time, organization and a fact table(sales).
In OLAP, this Snowflake schema approach increases the number of joins and poor performance in retrieval of data. In few organizations, they try to normalize the dimension tables to save space. Since dimension tables hold less space, Snowflake schema approach may be avoided.  

Important aspects of Star Schema & Snow Flake Schema:
In a star schema every dimension will have a primary key.
    * In a star schema, a dimension table will not have any parent table whereas in a snow flake schema, a dimension table will have one or more parent tables.
    * Hierarchies for the dimensions are stored in the dimensional table itself in star schema whereas hierachies are broken into separate tables in snow flake schema. These hierachies helps to drill down the data from topmost hierachies to the lowermost hierarchies.

22)Slowly Changing Dimensions

Dimensions that change over time are called Slowly Changing Dimensions. For instance, a product price changes over time; People change their names for some reason; Country and State names may change over time. Slowly Changing Dimensions are often categorized into three types namely Type1, Type2 and Type3.
Type 1: Overwriting the old values. (NO History)
In this Type 1, there is no way to find out the old value of the column since the table now contains only the new data.

Type 2: Creating an another additional record. (FULL History)
In this Type 2, the old values will not be replaced but a new row containing the new values will be added to the product table. So at any point of time, the difference between the old values and new values can be retrieved and easily be compared. This would be very useful for reporting purposes.

Type 3: Creating new fields. (PARTIAL History)
In this Type 3, the latest update to the changed values can be seen. (Oldest & Newest only)

* Design Methodlogies:
Top Down Bill Inmon,:- Data Warehouse --> Data Mart
Bottom Up: Ralph Kimball - Data Mart --> Data Warehouse