Data Mart, Data Warehouse, Data Lake

카테고리 없음

Data Mart, Data Warehouse, Data Lake

박휴지 : Park Tissue 2022. 4. 5. 06:44

RESTful APIs

automate data pipelines using Hadoop/Spark

SaaS

Data Lake

DL is a large repository of raw data, either unstructured or semi-structured. This data is aggregated from various sources and is simply stored. It is not altered to suit a specific purpose or fit into a particular format.

To prepare this data for analysis involves time-consuming data preparation, cleansing and reformatting for uniformity. Data lakes are great resources for municipalities or other organizations that store information related to outages, traffic, crime or demographics. Data lake is typically more cost-effective to implement and maintain because it is largely unstructured.

Data warehouse

DW is an aggregation of data from many sources to a single, centralized repository that unifies the data qualities and format, making it useful for data scientiest to use in data mining, AI, machine learning, and business intelligence.

Data warehousing could be used by a large city to aggregate electronic transactions from various departments, including speeding tickets, dog licenses, excise tax payments and other transactions. This structured data would be analyzed by the city to issue follow-up invoicing and to update census data and police logs. It could also be used by a developer to aggregate terabytes of data generated by sensors on automobiles to aid in the decision-making process for an autonomous driving solution.

Data Mart

DM is a subset of a data warehouse that benefits a specific set of users within the business or business unit. A data mart could be used by the marketing department of a manufacturing company to determine the ideal target demographic or persona to aid in the development of marketing plans. It could also be used by a manufacturing department to analyze performance and error rates to enable continuous improvement. Data sets within a data mart are often utilized in real time, for current analysis and actionable results.

Detail of DW(data warehouse)

I would like to think of a data warehouse as being more purpose-specific than a data lake.

While a data lake is a great place to dump all sorts of raw, structured and unstructred data in a quick way to clean and organize later, on the other hand, data warehouse is a large collection of organized and clean business data, ready to help an organization make decisions. And a data mart is like a subset of a data warehouse that's more specific to a particular business domain.

Data warehouse serves as the single source of truth for an organization across multiple knowledge domains. And data in the warehouse comes from multiple different source systems. And it's transformed from raw data to high quality data, optimized for analytics via various different ETL, or "Extract, Transform and Load" tools.

Data that's in our source systems can be in different types. It could be transcactional systems, it could be relational databases, and they can cover a wide variety of business domains.

So, the data could cover things like customer data from our CRMs, we could have sales data etc.

Once data has been cleaned, transformed and loaded into our data warehouse, it's now ready for us to expose to our users who can then start to take it and do analytics and machine learning on these data sets.

Our users can be folks like business analytics, data scientiest, data engineers

These folks can now start leveraging these data sets, either using the built-in analytics tools in the data warehouse of using a variety of different business intelligent or predictive analytics and machine learing platforms.

Three common ways in which a data warehouse can be deployed.

1. On prem

The first way is on-prem. We could have our data warehouse running on commodity hardware. This could be set up and structured using either MPP, or "Massively Parallel Processing" architecture where we just add more compute nodes as our workload grows. OR using SMP, "Symmetric Multi-Processing" architecture where typically, we have tightly coupled, multi-CPU system that shares resources from one common operating system.

2. Purpose-built appliance formant

This is typically an integrated stack of CPU, memory, storage, software.

All purpose-built and optimized for a data warehouse workload from a single vendor.

3. Cloud-based data warehouse,

A cloud-based data warehouse, where our data warehouse is deliverd as a managed SaaS offering via the multiple public cloud providers. Moving dataware house to the cloud is the next frontier for a lot of enterprises and for valid reasons. Some of the benefits include being able to free up resources to focus on other high value analytics tasks right, instated of just managing systems. Another benefit can be able to the ability to scale easily. Also, we get to leverage automatic upgrades. cloud-based warehouse can take a performance hit due to how it's fine-tuned for that specific workload and there can be some unanticipated high costs due to how cloud data warehouse is scaled.

---------------------------------------------------------------------------------------------------------------

What is a Data Warehouse?
A data warehouse is the core analytics system of an organization. The data warehouse will frequently work in conjunction with an operational data store (ODS) to ‘warehouse’ data captured by the various databases used by the business. For example, suppose a company has databases supporting POS, online activity, customer data, and HR data. In that case, the data warehouse will take the data from these sources and make them available in a single location. Again, the ODS will typically handle the process of cleaning and normalizing the data, preparing it for storage in the data warehouse.

The method of extracting data from the database, transforming it in the ODS, and loading it into the data warehouse is an example of the extract-transform-load (ETL) process, or the similar ELT process.

Because a data warehouse captures transformed (i.e. cleaned) historical data, it is an ideal tool for data analysis. Because business units will leverage the warehouse data to create reports and perform data analysis, business units are frequently involved in how the data is organized. Like a relational database, it typically uses SQL to query the data, and it uses tables, indexes, keys, views, and data types for data organization and integrity.

While a database can be a pseudo-data warehouse through the implementation of views, it is considered best practice to use a data warehouse for business user interaction leaving databases to capture transactional data. Because the chief intent is analytics, a data warehouse is used for online analytical processing (OLAP). OLAP is actually Zuar’s bread and butter, with our Mitto solution making it possible for companies to automate their ETL/ELT processes.

Main Characteristics of a Data Warehouse
1. Stores large quantities of historical data so old data is not erased when new data is updated
2. Captures data from multiple, disparate databases
3. Works with ODS to house normalized, cleaned data
4. Organized by subject
5. OLAP (online analytical processing) application
6. The primary data source for data analytics
7. Reports and dashboards use data from data warehouses

---------------------------------------------------------------------------------------------------------------

-Korean-

Data Mart
- DM는 현업에서 데이터를 활용하는 담당자가 데이터를 활용하기 위해 그대로 데이터를 쌓아두는 공간
- 각 현업에서 사용하는 업무단위처럼 상세한 단위로 데이터를 저장하고 사용한다.
- DM은 현업 담당자가 필요한데, 데이터를 직접 골라담아 소비할 수 있는 공간을 의미

Data Warehouse
- DW는 데이터 소매점(DM)에 공급하게 될 데이터를 다양한 원천에서 수집하여 주제별로 저장하는 데이터 도매점 같은 공간
- 다양한 원천에서 발생하는 데이터를 소비자에게 전달하기 전에 통합하여 저장하는 공간
- 다양한 원천에서 발생하는 데이터는 발생 지점의 환경에 따라 서로 다른 구조(스키마)나 데이터에 사용된 용어등의 문제가 야기됨. DW는 이러한 문제들을 전사적 관점에서 고려하여 다양한 데이터를 통합 저장하는 구조로 설계되고 구축

Data Lake
- DL은 다양한 원천을 하나의 통합된 형태로 정제된 DW와는 달라, 다양한 원천을 그대로 가져와 저장하여 다양성을 보존하는 스타일
- 즉, DL은 원천에서 발생한 데이터를 다양한 형태 그대로 한 곳에 저장하는 공간