Skip to content

Commit

Permalink
Updates for the Bootcamp lessons
Browse files Browse the repository at this point in the history
  • Loading branch information
tuhaihe committed Nov 10, 2023
1 parent c22139c commit 92bcf13
Show file tree
Hide file tree
Showing 15 changed files with 154 additions and 131 deletions.
2 changes: 1 addition & 1 deletion 000-cbdb-sandbox/run.sh
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@

docker build -f cbdb_Dockerfile -t cbdb:centos7 .
#! docker run -ti -d -v /sys/fs/cgroup:/sys/fs/cgroup:ro -p 22:22 -p 5432:5432 -h mdw cbdb:centos8
docker run -ti -d -v /sys/fs/cgroup:/sys/fs/cgroup:ro -p 22:22 -p 5432:5432 -h mdw cbdb:centos7
docker run -ti -d -v /sys/fs/cgroup:/sys/fs/cgroup:ro -p 122:22 -p 15432:5432 -h mdw cbdb:centos7
32 changes: 0 additions & 32 deletions 101-cbdb-tutorials/101-0-backgroud-database-concepts.md

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
title: "Introduction to Database and CloudberryDB Architecture"
---

# Lesson 0: Introduction to Database and CloudberryDB Architecture

## Background: Database Concepts

Before starting this tutorials, spend some time to get familiar with how (single instance) databases work. If you already have some knowledge and experience with Oracle, MySQL or especially Postgres - this is great.

Databases (relational databases) are pieces of software that are used to store and manage/process data. Usually these databases are built with the client/server concept - the database is implemented as a server and multiple clients can connect and read or update the data.

The clients usually use SQL language to access the data (or some dialect of the SQL language specification). The clients can be different implementations - proprietary client libraries or ODBC/JDBC compliant.

Database data is usually stored in objects called tables. Tables have predefined structure (columns) and have zero or multiple rows.

Tables can be grouped in logical entities called 'schemas' (or namespaces).

Tables/schemas are located in a 'database' entity. Some database software supports multiple databases per instance (MySQL, Postgres), others support one database per instance (Oracle).

Along with tables there are supporting objects such as indexes, sequences, views, etc.

The database system needs to maintain some metadata - called the database catalog. The database catalog contains information about the data objects and supporting objects as well as anything else that needs to be stored on system level (user authentication, etc.).

SQL (Structured Query Language) is a descriptive language, not imperative language. Therefore it describes what the user needs, not how to get it. When the user describes what he needs, the database need to decide how to get it. This process is called query optimization. The end result from this process is a query plan, which is a step by step instruction how to get the result.

## Introduction to the Cloudberry Database Architecture

Cloudberry Database is a massively parallel processing (MPP) database server with an architecture specially designed to manage large-scale analytic data warehouses and business intelligence workloads.

MPP (also known as a shared nothing architecture) refers to systems with two or more processors that cooperate to carry out an operation, each processor with its own memory, operating system and disks. Cloudberry uses this high-performance system architecture to distribute the load of multi-terabyte data warehouses and all of a system's resources in parallel to process a query.

Cloudberry Database is based on open-source PostgreSQL open-source. It is essentially several PostgreSQL database instances working together as one cohesive database management system (DBMS). It is based on PostgreSQL 14.4 kernel and in most cases it is very similar to PostgreSQL. Database users interact with Cloudberry Database as a regular PostgreSQL DBMS.

In CloudberryDB, internals of PostgreSQL have been modified and optimized to support parallel structure of Cloudberry Database. For instance, system catalog, optimizer, query executor and transaction manager components have been modified and enhanced to be able to execute queries simultaneously across the parallel PostgreSQL database instances. CloudberryDB interconnect (the networking layer) enables communication between distinct PostgreSQL instances and allows the system to behave as one logical database.

Cloudberry Database also includes features designed to optimize PostgreSQL for business intelligence (BI) workloads. For example, CloudberryDB has added parallel data loading (external tables), resource management, query optimizations and storage enhancements,.

_Figure 1. High-Level Cloudberry Database Architecture_

![High-Level Cloudberry Database Architecture](../images/highlevel_arch.jpg)

The following topics describe the components that make up a Cloudberry Database system and how they work together.

### CloudberryDB Master(Coordinator)

The Cloudberry Database master is the entry to the Cloudberry Database system, it accepts client connections, handle SQL queries and then distributs workload to the segment instances.

Cloudberry Database end-users only interact with Cloudberry Database through master node as a typical PostgreSQL database. They connect to database using client such as psql or drivers like JDBC or ODBC.

The master stores global system catalog. Global system catalog is set of system tables that contain metadata for Cloudberry Database itself. Master node does not contain any user table data; user table data resides only on segments. Master node would authenticate client connections, processe incoming SQL commands, distribute workloads among segments, collect the results returned by each segment and return the final results to the client.

### CloudberryDB Segments

Cloudberry Database segment instances are independent PostgreSQL databases that each of them store a portion of the data and perform the majority of query execution work.

When a user connects to the database via the Cloudberry master and issues queries, accordingly execution plan would be distributed to each segment instance. For more information about query processes, see About Cloudberry Query Processing.

The server that has segments running on it is called segment host. A segment host usually has two to eight Cloudberry segments running on it, the number depends on serveral factors, CPU cores, memory, disk, network interfaces or workloads. To get better performance from Cloudberry Database, it is suggested to distribute data and workloads evenly across segments so that execution plan could be finished across all segments and with no bottleneck.

### CloudberryDB Interconnect

The interconnect is the networking layer of the Cloudberry Database architecture.

The interconnect refers to the inter-process communication mechanism in-between segments. By default, interconnect uses User Datagram Protocol (UDP) to send/receive messages over the network. Interconnect provide datagram verification and retransmission mechanism. Reliability is equivalent to Transmission Control Protocol (TCP), performance and scalability exceeds TCP. If user choose TCP in interconnect, Cloudberry would have limit around 1000 segment instances. With UDP and interconncet the limit does not exit.

Now you can start the following lessons by clicking on the links:

- [Lesson 1: Create Users and Roles](../101-cbdb-tutorials/101-1-create-users-and-roles.md)
- [Lesson 2: Create and Prepare Database](../101-cbdb-tutorials/101-2-create-and-prepare-database.md)
- [Lesson 3: Create Tables](../101-cbdb-tutorials/101-3-create-tables.md)
- [Lesson 4: Data Loading](../101-cbdb-tutorials/101-4-data-loading.md)
- [Lesson 5: Queries and Performance Tuning](../101-cbdb-tutorials/101-5-queries-and-performance-tuning.md)
- [Lesson 6: Backup and Recovery Operations](../101-cbdb-tutorials/101-6-backup-and-recovery-operations.md)
3 changes: 3 additions & 0 deletions 101-cbdb-tutorials/101-3-create-tables.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@ title: "101 - Lesson 3: Create Tables"

After creating and preparing a database in [Lesson 2: Create and Prepare a Database](../101-cbdb-tutorials/101-2-create-and-prepare-database.md), you can start to create tables in the database.

> [!Note]
> To introduce CloudberryDB Database, we use a public data set, the Airline On-Time Statistics and Delay Causes data set, published by the United States Department of Transportation at http://www.transtats.bts.gov/. The On-Time Performance dataset records flights by date, airline, originating airport, destination airport, and many other flight details. Data is available for flights since 1987. The exercises in this guide use data for about a million flights in 2009 and 2010. You are encouraged to review the SQL scripts in the `000-cbdb-sandbox/configs/faa.tar.gz` directory as you work through this introduction. You can run most of the exercises by entering the commands yourself or by executing a script in the faa directory.
## Create tables using a SQL file in psql

In Cloudberry Database, you can use the `CREATE TABLE` SQL statement to create a table.
Expand Down
4 changes: 2 additions & 2 deletions 101-cbdb-tutorials/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ title: "Cloudberry Database Tutorials Based on Single-Node Installation"

This folder contains a series of tutorials for quickly trying out Cloudberry Database based on the single-node installation.

Before starting to read the tutorials, you are expected to finish installing the single-node Cloudberry Database by following [Install a Single-Node Cloudberry Database](../000-cbdb-sandbox/README.md). In addition, it is a good idea to know more about Cloudberry Database. It is recommended to first read [Cloudberry Database Overview](./introduction-to-cloudberrydb-in-database-analytics.md) and [Cloudberry Database Architecture](./introduction-to-the-cloudberry-database-architecture.md).
Before starting to read the tutorials, you are expected to finish installing the single-node Cloudberry Database by following [Install a Single-Node Cloudberry Database](../000-cbdb-sandbox/README.md). In addition, it is a good idea to know more about Cloudberry Database. It is recommended to first learn about the basic [database concepts and CloudberryDB Architecture](./101-0-introduction-to-database-and-cloudberrydb-architecture.md).

The series includes the following tutorials. Follow them in sequence.

- [Lesson 0: Background Concepts of Databases](../101-cbdb-tutorials/101-0-backgroud-database-concepts.md)
- [Lesson 0: Introduction to Database and CloudberryDB Architecture](../101-cbdb-tutorials/101-0-introduction-to-database-and-cloudberrydb-architecture.md)
- [Lesson 1: Create Users and Roles](../101-cbdb-tutorials/101-1-create-users-and-roles.md)
- [Lesson 2: Create and Prepare Database](../101-cbdb-tutorials/101-2-create-and-prepare-database.md)
- [Lesson 3: Create Tables](../101-cbdb-tutorials/101-3-create-tables.md)
Expand Down

This file was deleted.

2 changes: 1 addition & 1 deletion 102-cbdb-crash-course/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ title: Cloudberry Database Crash Course

# Cloudberry Database Crash Course

This crash course provides an extensive overview of Cloudberry Database (CBDB), an open-source Massively Parallel Processing (MPP) database. It covers key concepts, features, utilities, and hands-on exercises to become proficient with CBDB.
This crash course provides an extensive overview of Cloudberry Database, an open-source Massively Parallel Processing (MPP) database. It covers key concepts, features, utilities, and hands-on exercises to become proficient with CBDB.

Topics include:

Expand Down
12 changes: 7 additions & 5 deletions 103-cbdb-performance-benchmark/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
<!-- ![Cloudberry](../images/cloudberrydb_logo.png) -->
# CloudberryDB Performance Benchmark

This tutorial will show you how to do the CloudberryDB performance by the CloudberryDB sandbox docker image.
This tutorial will show you how to perform a CloudberryDB performance benchmark in the CloudberryDB Sandbox Docker image. The benchmark process consists of two parts:

Including two parts, you can click to see more details:
- [Part 1: TPCH benchmark](../103-cbdb-performance-benchmark/tpch.md), which is based on the benchmark tool `TPC-H`.
- [Part 2: TPCDS benchmark](../103-cbdb-performance-benchmark/tpcds.md), which is based on the benchmark tool `Greenplum Database TPC-DS`.
- [Part 1: TPC-H benchmark](../103-cbdb-performance-benchmark/tpch.md)
- [Part 2: TPC-DS benchmark](../103-cbdb-performance-benchmark/tpcds.md)

These benchmarks are designed to simulate real-world scenarios and measure the performance of decision support systems under various conditions.

By completing this tutorial, you will gain a comprehensive understanding of CloudberryDB's performance capabilities and how to effectively benchmark its performance using industry-standard tools and techniques.
Loading

0 comments on commit 92bcf13

Please sign in to comment.