Spread 디버깅

Spread 디버깅

VERTICA/99.Best Practices 2017. 5. 10. 16:35

Spread란 무엇입니까?

Vertica는 오픈 소스 도구키트인 Spread를 사용하여 네트워크 장애에 대한 복원력이 뛰어난 고성능 메시징 서비스를 제공합니다.
데이터베이스를 처음 시작하면 Spread 데몬이 자동으로 실행됩니다.
Spread 데몬은 클러스터의 제어 노드에서 실행되고, 제어 노드는 메시지 커뮤니케이션을 관리합니다.

Vertica 프로세스와 Spread 데몬 페어스(Pairs)

Vertica 설치를 시작하면 Spread 데몬이 데이터베이스에 설치됩니다.
Vertica 프로세스의 노드는 도메인 소켓을 사용하여 Spread 데몬과 통신합니다.
노드 간 통신은 2개 채널인 데이터 채널과 제어 채널(UDP 메인 채널과 UDP 토큰 채널)을 통해 이루어집니다.
다음은 도메인 소켓과 2개 채널이 포함된 4노드 클러스터를 나타낸 이미지입니다.

동일한 노드의 Vertica 프로세스와 Spread 데몬은 도메인 소켓을 통해 서로 연결됩니다.
한 노드의 Vertica 프로세스는 TCP 소켓을 통해 다른 노드의 Vertica 프로세스와 연결됩니다.

한 노드의 Spread 데몬은 UDP 토큰 및 메인 채널을 통해 다른 노드의 Spread 데몬에 연결됩니다

Vertica와 Spread의 공동 실행 방식

모든 노드가 정상적인 상태라는 가정 하에 Vertica와 Spread의 통신은 원활하게 이루어집니다.

Spread 데몬은 UDP 패킷을 교환하여 서로 통신하며, 다음과 같이 UDP 채널(또는 포트) 2개가 사용됩니다.

메인 채널 - Spread 데몬이 Spread 관련 제어 메시지를 다른 데몬에게,
그리고 Vertica 서버에서 시작된 Vertica 관련 제어 메시지를 전송합니다.
토큰 채널 - Spread 토큰이라고 불리는 특별한 메시지가 Spread 데몬 사이에 전송됩니다.
토큰은 멤버십이나 장애 허용뿐만 아니라 제어 메시지 순서를 일관적으로 유지하는 데도 중요합니다.

한 노드의 Spread가 클러스터에 속한 다른 Spread와 통신이 중단되면
Spread 데몬이 클러스터 멤버십에서 해당 노드를 제거합니다.
단, 노드를 멤버십에서 제거하기 전에 8초를 기다립니다.

Vertica와 Spread가 통신할 수 없는 경우

다음과 같이 Spread 데몬과 Vertica의 통신이 중단되는 상황이 발생할 수 있습니다.

Spread 토큰 제한 시간

Spread 데몬은 토큰 메커니즘을 사용하여 서로의 상태를 모니터링합니다.
토큰은 UDP(User Datagram Protocol)를 사용하여 현재 활성 멤버십에 속한 노드 사이에 전달되는
특별한 Spread 메시지입니다.
이 토큰으로 모든 노드의 Spread 데몬이 활성화 상태인 것을 알 수 있습니다.
토큰의 활성화 시간은 8초입니다.

kill-9를 통해 갑자기 강제 종료된 Vertica 프로세스

Vertica 프로세스는 Spread 데몬을 대상으로 PROBE 메시지를 작성합니다.
하지만 Vertica 프로세스가 충돌을 일으키거나 kill-9를 통해 갑자기 종료되면
해당 Vertica 프로세스는 PROBE 메시지를 나머지 노드에게 전송하거나 Spread API인 SP_disconnect를 호출할 수 없습니다.

강제 종료된 Vertica 프로세스의 Spread 데몬이 여전히 실행 중인 경우에는
강제 종료된 Vertica 프로세스에 대한 메시지가 TCP 소켓 연결을 통해 Spread 데몬에게 수신됩니다.
그러면 Spread 데몬이 클러스터의 다른 노드에게 해당 Vertica 프로세스가 현재 멤버십에서 제외되었다는 정보를 알립니다.

클러스터 중단

사용자가 Spread 데몬을 강제 종료하면
클러스터에 속한 다른 노드의 Spread 데몬이 토큰 제한 시간을 통해 강제 종료된 Spread 데몬을 감지합니다.
예를 들어 K-safety 값이 1인 8노드 클러스터가 있다고 가정하겠습니다.
여기에서 노드 4개가 중단되거나 버디 노드 2개가 중단되면 클러스터가 중단됩니다.
클러스터가 중단되면 Vertica 프로세스가
나머지 Spread 데몬이 새로운 멤버십을 생성할 수 있도록 멤버십 프로토콜을 호출합니다.

문제 해결

Spread 로깅

Spread 로깅은 기본적으로 비활성화되어 있습니다.
이 기능을 활성화하면 토큰 전송이 지연되어 노드가 8초간 토큰을 수신하지 못할 수도 있습니다.
8초가 지나면 Spread 데몬이 통신이 이루어지지 않는 노드를 클러스터에서 제거합니다.
만약 spread 로깅이 활성화되어 있다면 이러한 지연 시간을 방지하기 위해 Spread 로깅을 비활성화 할 것을 권장합니다.

네트워크 대역폭 고갈

다음 명령을 사용하여 네트워크 사용량을 확인하고 줄일 수 있습니다.

1. DC_NETWORK_INFO 시스템 테이블에 대한 쿼리를 실행하여 시스템의 네트워크 사용량을 확인합니다.
시작 시간과 종료 시간을 지정하여 특정 기간의 네트워크 사용량을 알아볼 수도 있습니다

=> SELECT node_name

     ,start_time StartTime
     ,end_time EndTime
     ,tx_kbytes_per_sec
     ,rx_kbytes_per_sec
     ,tx_kbytes_per_sec + rx_kbytes_per_sec total_kbytes_per_sec
FROM (
     SELECT node_name
           ,round(min(start_time), 'SS') AS start_time
           ,round(max(end_time), 'SS') AS end_time
           ,round(((sum(tx_bytes_end_value - tx_bytes_start_value) / 1024) / (datediff('millisecond', min(start_time), max(end_time)) / 1000))::FLOAT, 2) AS tx_kbytes_per_sec
           ,round(((sum(rx_bytes_end_value - rx_bytes_start_value) / 1024) / (datediff('millisecond', min(start_time), max(end_time)) / 1000))::FLOAT, 2) AS rx_kbytes_per_sec
     FROM dc_network_info_by_second
     WHERE start_time > '2016-09-13 16:00:00-04'
           AND end_time < '2016-09-13 17:00:00-04'
                 and
           interface_id LIKE 'eth0'
     GROUP BY node_name
           ,round(start_time, 'SS')
     ) a
ORDER BY 2,node_name;

     node_name     |      StartTime      |       EndTime       | tx_kbytes_per_sec | rx_kbytes_per_sec | total_kbytes_per_sec
-------------------+---------------------+---------------------+-------------------+-------------------+---------------------
v_vmartdb_node0001 | 2016-09-14 05:00:03 | 2016-09-14 05:00:04 |                 0 | 6.68              | 6.68
v_vmartdb_node0002 | 2016-09-14 05:00:03 | 2016-09-14 05:00:04 |                 0 | 6.66              | 6.66
v_vmartdb_node0003 | 2016-09-14 05:00:03 | 2016-09-14 05:00:04 |                 0 | 6.67              | 6.67
v_vmartdb_node0001 | 2016-09-14 05:00:04 | 2016-09-14 05:00:05 |              0.38 | 0.19              | 0.57

vnetperf를 사용하여 호스트의 네트워크 성능을 측정합니다.
vnetperf를 사용하여 전송 및 수신된 MB를 가용 대역폭과 비교합니다.
다음 예제에서 가용 대역폭은 약 128MB(1Gbps)입니다.

$ /opt/vertica/bin/vnetperf

(example)
Date                    | Test           | Rate Limit (MB/s) | Node   | MB/s (sent) | MB/s (rec) |
----------------------------------------------------------------------------------------------------
2016-09-14_11:33:37,973 | tcp-throughput | 32                | average | 31.1396     | 31.1396     |
2016-09-14_11:33:40,21 | tcp-throughput | 64                | average | 61.6482     | 61.6467     |
2016-09-14_11:33:42,65 | tcp-throughput | 128               | average | 122.681     | 122.683     |
2016-09-14_11:33:44,125 | tcp-throughput | 256               | average | 144.845     | 148.393     |
2016-09-14_11:33:46,664 | tcp-throughput | 512               | average | 154.858     | 160.587     |
2016-09-14_11:33:49,291 | tcp-throughput | 640               | average | 147.637     | 151.906     |
2016-09-14_11:33:51,928 | tcp-throughput | 768               | average | 152.137     | 156.577     |
2016-09-14_11:33:54,455 | tcp-throughput | 1024              | average | 150.206     | 153.423     |

2016-09-14_11:33:56,965 | tcp-throughput | 2048 | average | 155.692 | 157.993 |

Database Designer를 사용하여 쿼리를 튜닝합니다.
최적화되지 않은 프로젝션은 RESEGMENT/BROADCAST 연산자로 인해 네트워크 사용량이 높아질 수 있습니다.

그래도 네트워크 사용량이 계속해서 높은 경우 네트워크 업그레이드를 권장합니다.

네트워크 재전송 문제

Vertica 데이터베이스가 높은 RAM과 메모리를 가지고 있다면
기본 메모리가 쓰기-캐싱에 사용되고 풀 플러시(full flush)로 이어집니다.
풀 플러시는 장시간(약 8초) Spread 데몬을 차단하여 토큰 제한 시간에 걸릴 수 있습니다.

디스크가 들어오는 쓰기 요청을 따라잡지 못할 경우 시스템이 애플리케이션을 통해 데이터를 디스크에 씁니다.
그러면 쓰기 캐시가 임계값을 넘게 됩니다.
이 경우 커널이 캐시가 정의된 임계값 밑으로 떨어질 때까지 모든 I/O 요청을 차단합니다.
이로 인해 간혹 부하가 높은 상황에서는 노드가 클러스터에서 제거되기도 합니다.

vm.dirty_background_ratio = 10

vm.dirty_ratio = 20

첫 번째 매개 변수인 vm.dirty_background 값은 쓰기-캐시 사용량의 임계값을 정의합니다.
임계값에 도달하면 커널이 디스크로 백그라운드 플러시를 실행합니다.

두 번째 매개 변수인 vm.dirty_ratio는 메모리 임계값을 정의합니다.
임계값에 이르면 플러시가 끝날 때까지 커널이 다른 I/O 요청을 차단합니다

네트워크 UDP 수신 오류

계속해서 다수의 UDP 패킷 수신 오류가 표시되면 다음과 같이 커널 매개 변수를 튜닝해야 합니다.

1. 다음 명령을 사용하여 UDP 네트워크 패킷 수신 오류의 수를 조회합니다

$ netstat -su

Udp:
    490180944 packets received
    359 packets to unknown port received.
    43303030 packet receive errors
    37289 packets sent

UDP 패킷 수신 오류의 수가 많으면 다음 매개 변수를 /etc/sysctl.conf 파일에 추가합니다.

아래는 UDP 오류 문제와 관련된 매개 변수입니다.

net.core.rmem_max
net.core.rmem_default
net.core.netdev_max_backlog
net.ipv4.udp_mem
net.ipv4.udp_rmem_min
net.ipv4.udp_wmem_min

# Increase number of incoming connections
net.core.somaxconn = 1024
# Sets the send socket buffer maximum size in bytes.
net.core.wmem_max = 16777216
# Sets the receive socket buffer maximum size in bytes.
net.core.rmem_max = 16777216
# Sets the receive socket buffer default size in bytes.
net.core.wmem_default = 262144
# Sets the receive socket buffer maximum size in bytes.
net.core.rmem_default = 262144
# Sets the maximum number of packets allowed to queue when a particular interface receives packets faster than the kernel can process them.
# increase the length of the processor input queue
net.core.netdev_max_backlog = 2000
net.ipv4.tcp_mem = 16777216 16777216 16777216
net.ipv4.tcp_wmem = 8192 262144 8388608
net.ipv4.tcp_rmem = 8192 262144 8388608
net.ipv4.udp_mem = 16777216 16777216 16777216
net.ipv4.udp_rmem_min = 16384

net.ipv4.udp_wmem_min = 16384

net.core.rmem_max=16777216 에서 UDP 오류가 있을 경우에는 다음 명령을 사용하여 이 값을 2배까지 높일 수 있습니다.

sudo sysctl -w net.core.rmem_max=33554432

워크로드 동시성이 높거나, Vertica가 CPU 연산 집약적인 경우에는
다음 명령을 사용하여 대기열과 대기열 깊이의 메모리를 늘릴 수 있습니다

sudo sysctl -w net.core.netdev_max_backlog=2000

매개 변수를 /etc/sysctl.conf 파일에 추가한 후에는 다음 명령을 실행합니다.

$ sysctl -p

메모리 고갈

다음 명령을 사용하여 메모리 사용량을 확인하고 줄일 수 있습니다.

시스템의 메모리 사용량을 확인합니다.

=> SELECT node_name, round(start_time, 'SS') as start_time,
   round(end_time, 'SS') as end_time,
   round(100 -
           ( free_memory_sample_sum       / free_memory_sample_count +
             buffer_memory_sample_sum     / free_memory_sample_count +
             file_cache_memory_sample_sum / file_cache_memory_sample_count ) /
           ( total_memory_sample_sum      / total_memory_sample_count ) * 100.0, 2.0)
   as average_memory_usage_percent
FROM dc_memory_info_by_second
WHERE start_time between '2016-09-13 15:00:00-04' and '2016-09-13 16:00:00-04'
order by start_time, node_name;

    node_name       |     start_time      |      end_time       | average_memory_usage_percent
--------------------+---------------------+---------------------+-----------------------------
v_vmartdb_node0001 | 2016-09-14 04:00:00 | 2016-09-14 04:00:01 | 79.52
v_vmartdb_node0002 | 2016-09-14 04:00:00 | 2016-09-14 04:00:01 | 71.29
v_vmartdb_node0003 | 2016-09-14 04:00:00 | 2016-09-14 04:00:01 | 71.47
v_vmartdb_node0001 | 2016-09-14 04:00:01 | 2016-09-14 04:00:02 | 79.52
v_vmartdb_node0002 | 2016-09-14 04:00:01 | 2016-09-14 04:00:02 | 71.29v_vmartdb_node0003 | 2016-09-14 04:00:01 | 2016-09-14 04:00:02 | 71.47

카탈로그 크기를 확인합니다.

=> SELECT node_name

       ,max(ts) AS ts
       ,max(catalog_size_in_MB) AS catlog_size_in_MB
FROM (
       SELECT node_name,trunc((dc_allocation_pool_statistics_by_second."time")::TIMESTAMP, 'SS'::VARCHAR(2)) AS ts ,sum((dc_allocation_pool_statistics_by_second.total_memory_max_value - dc_allocation_pool_statistics_by_second.free_memory_min_value)) / (1024 * 1024) AS catalog_size_in_MB
       FROM dc_allocation_pool_statistics_by_second
       GROUP BY 1,trunc((dc_allocation_pool_statistics_by_second."time")::TIMESTAMP, 'SS'::VARCHAR(2))
       ) foo
GROUP BY 1
ORDER BY 1;

    node_name       |         ts          |   catlog_size_in_MB
--------------------+---------------------+-----------------------
v_vmartdb_node0001 | 2016-09-29 19:50:16 | 5343.6447143554687500
v_vmartdb_node0002 | 2016-09-29 19:50:16 | 4889.1784667968750000
v_vmartdb_node0003 | 2016-09-29 19:50:16 | 4861.3525390625000000

카탈로그 크기가 총 메모리 크기의 4%보다 큰 경우에는
Vertica GENERAL 풀 메모리 사용량을 줄이고 클러스터를 다시 시작합니다.

예를 들어 카탈로그 크기가 5GB이고, 물리적 메모리 크기가 100GB라면 95% MEMORYSIZE는 95GB입니다.
그러면 카탈로그 정보가 Vertica 메모리 공간 외부에 로드됩니다. 따라서 일반 풀의 MAXMEMORYSIZE를 줄여야 합니다

다음 명령을 사용하여 일반 풀의 MAXMEMORYSIZE를 줄입니다.

=> SELECT name, maxmemorysize FROM resource_pools WHERE name='general';

name | maxmemorysize
---------+---------------
general | Special: 95%
(1 row)

=> ALTER RESOURCE POOL general maxmemorysize '90%';
NOTICE 2585: Change takes effect upon restart. Recovering nodes will use the new value
ALTER RESOURCE POOL

=> SELECT name, maxmemorysize FROM resource_pools WHERE name='general';

name | maxmemorysize
---------+---------------
general | Special: 90%
(1 row)

그 밖에 카탈로그 전용 리소스 풀을 새롭게 생성하는 방법도 있습니다.

=> CREATE RESOURCE POOL catalog_pool memorysize '4G';

CREATE RESOURCE POOL

CPU 연산 집약적

다음 명령을 사용하여 CPU 사용량을 확인하고 줄일 수 있습니다.

시스템의 CPU 사용량을 확인합니다.
시작 시간과 종료 시간을 지정하여 특정 기간의 CPU 사용량을 알아볼 수도 있습니다.

=> SELECT node_name, round(start_time, 'SS') as start_time, round(end_time, 'SS') as end_time, round(100 -

((idle_microseconds_end_value - idle_microseconds_start_value) /
(user_microseconds_end_value + nice_microseconds_end_value + system_microseconds_end_value
+ idle_microseconds_end_value + io_wait_microseconds_end_value + irq_microseconds_end_value
+ soft_irq_microseconds_end_value + steal_microseconds_end_value + guest_microseconds_end_value
- user_microseconds_start_value - nice_microseconds_start_value - system_microseconds_start_value
- idle_microseconds_start_value - io_wait_microseconds_start_value - irq_microseconds_start_value
- soft_irq_microseconds_start_value - steal_microseconds_start_value - guest_microseconds_start_value)
) * 100, 2.0) average_cpu_usage_percent
FROM dc_cpu_aggregate_by_second
where start_time between '2016-09-13 15:00:00-04' and '2016-09-13 16:00:00-04'
order by start_time, node_name;

node_name | start_time | end_time | average_cpu_usage_percent
--------------------+---------------------+---------------------+---------------------------
v_vmartdb_node0001 | 2016-09-14 04:00:00 | 2016-09-14 04:00:01 | 98.53
v_vmartdb_node0002 | 2016-09-14 04:00:00 | 2016-09-14 04:00:01 | 95.67
v_vmartdb_node0003 | 2016-09-14 04:00:00 | 2016-09-14 04:00:01 | 96.97

CPU 사용량이 너무 높을 때는 단일 코어를 Spread에 할당합니다.

a. Spread 데몬 ID와 Vertica 프로세스 ID를 확인합니다.

$ ps -ef|grep catalog

dbadmin 461884      1 0 Sep13 ?        00:00:07 /opt/vertica/spread/sbin/spread -c /home/dbadmin/VMartDB/v_vmartdb_node0001_catalog/spread.conf -D /opt/vertica/spread/tmp
dbadmin 461886      1 1 Sep13 ?        00:03:52 /opt/vertica/bin/vertica -D /home/dbadmin/VMartDB/v_vmartdb_node0001_catalog -C VMartDB -n v_vmartdb_node0001 -h 192.168.30.71 -p 5433 -P 4803 -Y ipv4
dbadmin 461902 461886 0 Sep13 ?        00:00:11 /opt/vertica/bin/vertica-udx-zygote 15 14 461886 debug-log-off /home/dbadmin/VMartDB/v_vmartdb_node0001_catalog/UDxLogs 60 16 0
dbadmin 509818 504522 0 05:56 pts/0    00:00:00 grep catalog

위 출력을 보면 461884가 Spread 데몬 ID이고, 461886이 Vertica 프로세스 ID입니다.

b. taskset를 사용하여 Spread 데몬의 CPU 선호도를 확인합니다

$ taskset -cp 461884

pid 461884's current affinity list: 0,1,2,3

c. taskset를 사용하여 Vertica 프로세스 ID의 CPU 선호도를 확인합니다.

$ taskset -cp 461886

pid 461886's current affinity list: 0,1,2,3

d. 코어 0을 Spread에, 그리고 코어 1, 2, 3을 Vertica에 할당합니다

$ taskset -cp 0 461884

pid 461884's current affinity list: 0-3
pid 461884's new affinity list: 0

$ taskset -cp 1,2,3 461886
pid 461886's current affinity list: 0-3
pid 461886's new affinity list: 1-3

'VERTICA > 99.Best Practices' 카테고리의 다른 글

Vertica 임포트 및 익스포트의 이해 (0)	2017.11.28
Tuple Mover 모범 사례 (0)	2017.04.04

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

인기포스트

ABOUT ME

Bigdata Bigdata

'VERTICA > 99.Best Practices' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

인기포스트

ABOUT ME

'VERTICA > 99.Best Practices' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역