COMP810 Data Warehousing and Big Data
Assessment 1–Data Warehousing Project
To design, implement and analyse a Data Warehouse (DW) for MarketExpress, one of the biggest
online sales websites.
MarketExpress is one of the biggest online websites, selling products from various suppliers.
Suppliers’ products are kept in MarketExpress warehouses and orders (customer transactions) are
fulfilled from warehouses located all over New Zealand. MarketExpress has thousands of
customers and therefore it is important for the organisation to analyse the shopping behaviour of
their customers. Based on such analysis the organisation can optimise their selling techniques e.g.
by having relevant promotions on different products.
Assessment Part 1. Build a DW
You will build a DW using the star schema which is a data modelling technique that is used to map
multidimensional decision support data into a relational DW. Star schema is an easily
implemented model for multidimensional data analysis while still preserving the relational
structures on which the operational database is built.
The star schema represents aggregated data for specific business activities. Using this schema, one
can store aggregated data from multiple sources that will represent different aspects of business
operations. For example, the aggregation may involve total sales by selected time periods, by
products, by warehouses, and so on. Aggregated totals can be the total number of product sold,
total sales values by products, and so on. The basic star schema has three main components: facts,
dimensions, attributes. Usually in case of star-schema for sales the dimension tables are: product,
date, warehouse, and supplier while the fact table is sales. However, to determine the right tables
and attributes you will need to consider Figure 2, and build a DW with the appropriate fact table
and dimension tables, each with appropriate attributes.
The assessment provides a script file named “DataStream_MasterData_Creator.sql”. By executing
the script it will create two tables in your account. One is DATASTREAM table with 10,000 records
populated in it. This data will be generated randomly based on 100 products, 50 customers, 10
warehouses, and one year time period as a date – from 01-Jan-19 to 31-Dec-19. The values for
the quantity attribute will be random between 1 and 10. The other is MASTERDATA table with 100
records in it. The structure of both tables with their attribute names and data types is given in
Figure 4. The attributes DATASTREAM_ID and PRODUCT_ID are primary keys in DATASTREAM and
MASTERDATA tables respectively.
Assessment Part 2. Enrich Customer Transaction Data and Store in DW
To implement this enrichment feature in the transformation phase of ETL we need a join operator
typically called Semi-Stream Join (SSJ). There are a number of algorithms available to implement
this join operation however, the simplest one is Index Nested Loop Join (INLJ) which is explained in
the next section and you are required to implement it in this project.
Index Nested Loop Join (INLJ)
The INLJ is a traditional join operator to implement the join operation between DS and MD. In
INLJ, DS is scanned tuple by tuple and based on this, the disk-based MD is accessed using an index
on the join attribute. A graphical overview of an INLJ is shown in Figure 3 where DS tuple si is
joined with MD tuple rj.
The crux of INLJ is that with every loop step, the algorithm reads a chunk of DS tuples and joins
them one-by-one with the relevant tuples in MD. To access the relevant tuple from MD the
algorithm uses an index on the join attribute in MD. For example, for the scenario presented in
Figure 2 the join attribute is PRODUCT_ID which should be considered as the Primary Key (PK) in
MD while a Foreign Key (FK) in DS and therefore, the join would be between PK and FK.
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: firstname.lastname@example.org 微信号:vipnxx