Vertica 에서 Kmeans 수행하기

VERTICA/10.Machine Learning

Vertica 에서 Kmeans 수행하기

버리까 2016. 12. 27. 18:56

K-평균 알고리즘(K-means algorithm)은 주어진 데이터를 k개의 클러스터로 묶는 알고리즘으로,

각 클러스터와 거리 차이의 분산을 최소화하는 방식으로 동작한다.

이 알고리즘은 자율 학습의 일종으로, 레이블이 달려 있지 않은 입력 데이터에 레이블을 달아주는
역할을 수행한다.

(위키백과 참조 : https://ko.wikipedia.org/wiki/K-%ED%8F%89%EA%B7%A0_%EC%95%8C%EA%B3%A0%EB%A6%AC%EC%A6%98 )

실습대상데이터

iris

아이리스는 붓꽃에 대한 꽃받침,꽃잎의(길이,너비) 정보 데이터가 들어있다.

꽃받침과 꽃잎이 무었인지 아리까리하다면..
여기에서 잠깐확인하시거나 포털검색을.. -> http://withbook.tistory.com/426

iris 테이블구조

CREATE TABLE public.iris

(

id int, -- 순번

Sepal_Length float, --꽃받침길이

Sepal_Width float, --꽃받침너비

Petal_Length float, --꽃잎길이

Petal_Wdith float, --꽃잎너비

Species varchar(10) --붗꽃종류(3종류)

);

iris 프로젝션구조

CREATE PROJECTION public.iris

(

id,

Sepal_Length,

Sepal_Width,

Petal_Length,

Petal_Wdith,

Species

)

SELECT iris.id,

iris.Sepal_Length,

iris.Sepal_Width,

iris.Petal_Length,

iris.Petal_Wdith,

iris.Species

FROM public.iris

ORDER BY iris.id,

iris.Sepal_Length,

iris.Sepal_Width,

iris.Petal_Length,

iris.Petal_Wdith,

iris.Species

SEGMENTED BY hash(iris.id, iris.Sepal_Length, iris.Sepal_Width, iris.Petal_Length, iris.Petal_Wdith, iris.Species) ALL NODES KSAFE 1;

데이터는 다음과 같이 들어있다.(샘플데이터는 : https://github.com/vertica/Machine-Learning-Examples 여기에서)

id	Sepal_Length	Sepal_Width	Petal_Length	Petal_Wdith	Species
4	4.6	3.1	1.5	0.2	setosa
중략...
68	5.8	2.7	4.1	1	versicolor
69	6.2	2.2	4.5	1.5	versicolor
중략...
114	5.7	2.5	5	2	virginica

Vertica Kmeans 문법

kmeans ( 'model_name', 'input_table', 'input_columns', num_clusters
	[, '--exclude_columns=col1, col2, ... coln]
	[ --max_iterations=value]
	[ --epsilon=value]
	[ --init_method=method]
	[ --initial_centers_table=table_name]
	[ --distance_method=method]
	[ --output_view=output_view]
	[ --key_columns=key_columns]
	[ --description="model_description"]' )

Arguments

model_name : 모델의 이름.모델 이름은 대소문자를 구분하지 않는다.

input_table : 입력 데이터 테이블

input_columns : 클러스터링을 하기위한 참조컬럼, 와일드카드(*)도 지원한다.

num_clusters : 클러스터수(사용자가 맘대로 지정한다)

Parameter	Data Type	Description
exclude_columns= col1, col2, ... coln	VARCHAR	클러스터링에서 제외 할 입력 테이블의 컬럼

max_iterations=value	INTEGER	알고리즘이 수행하는 최대 반복 횟수. 이 값을 수렴에 필요한 반복 수보다 낮은 수로 설정하면 알고리즘이 수렴되지 않을 수 있음.
		기본값: 10
epsilon=value	FLOAT	알고리즘이 수렴했는지 여부를 판별함. 반복 후에 어떤 클러스터 중심의 구성 요소도 엡실론 값 이상으로 변경되지 않으면 알고리즘이 수렴.
		기본값: 1e-4, 1e: 수학에서는 보통 0보다는 크지만 아주작은 임의의 숫자를 나타낼 때 쓰인다. (https://namu.wiki/w/엡실론)
init_method=method	VARCHAR	초기 클러스터 센터를 찾는 데 사용되는 방법. initial_centers 매개 변수에 값이있는 경우 이 매개 변수를 사용할 수 없다. init_method 및 initial_centers_table에 값을 제공하면 Vertica가 오류를 반환.
		기본값: random
initial_centers_table= table_name	VARCHAR	초기 클러스터 센터가있는 테이블이 사용됨. 사용하려는 초기 센터를 알고 Vertica가 초기 클러스터 센터를 찾지 못하도록 하려는 경우 이 값을 입력.
		init_method 매개 변수에 값이 있으면 이 매개 변수를 사용할 수 없다. init_method 및 initial_centers_table에 값을 제공하면 Vertica가 오류를 반환.
distance_method= method	VARCHAR	거리 함수는 클러스터링 할 때 점 사이의 거리를 결정하는 데 사용.
		기본값: euclidean, 유클리드 호제법(- 互除法, Euclidean algorithm)은 2개의 자연수 또는 정식(整式)의 최대공약수를 구하는 알고리즘의 하나이다. 호제법이란 말은 두 수가 서로(互) 상대방 수를 나누어(除)서 결국 원하는 수를 얻는 알고리즘을 나타낸다. 2개의 자연수(또는 정식) a, b에 대해서 a를 b로 나눈 나머지를 r이라 하면(단, a>b), a와 b의 최대공약수는 b와 r의 최대공약수와 같다. 이 성질에 따라, b를 r로 나눈 나머지 r'를 구하고, 다시 r을 r'로 나눈 나머지를 구하는 과정을 반복하여 나머지가 0이 되었을 때 나누는 수가 a와 b의 최대공약수이다. 이는 명시적으로 기술된 가장 오래된 알고리즘으로서도 알려져 있으며, 기원전 300년경에 쓰인 유클리드의 《원론》 제7권, 명제 1부터 3까지에 해당한다. (위키백과참조)
output_view= output_view	VARCHAR	클러스터에 뷰의 이름 (사용자가 맘대로 지정가능)

key_columns= key_columns	VARCHAR	식별자키, output_view에 값을 제공하지 않고 key_columns 값을 제공하면 Vertica는 오류를 반환.

description= description	VARCHAR (2048)	모델설명. v_ml.models 테이블에 저장.

예제

SELECT v_ml.kmeans('ytkimKmeansModel5', 'iris', '*', 3,

'--max_iterations=1 --output_view=ytkimKmeansView5 --key_columns=id

--exclude_columns=Species --description="ytkim_iris_kmeans_model5"');

수행후 ytkimKmeansView5 조회

SELECT * FROM ytkimKmeansView5;

id	cluster_id
2	0
3	0
47	0
49	0
50	0
61	0
71	1
97	1
115	1
119	2
121	2
122	2
126	2

.. 하략

뭐 당연하겠지만 ...

클러스터아이디가 3개(0,1,2) 종류로 분류되었음을 알 수 있다.

모델의 요약정보를 확인하려면

SUMMARIZE_MODEL 함수를 이용하면 된다.

- 해당함수를 사용하면 본인세션에서 진행중인 모든 트랜잭션이 commit 되니 주의해서 사용해야한다.

문법은 다음과 같다.

SUMMARIZE_MODEL ( 'model_name' [ , 'owner' ] )

model_name	요약정보를 확인하려는 모델명
owner	모델을 만든유저

아까만든 모델의 요약정보를 보려면

SELECT v_ml.SUMMARIZE_MODEL ('ytkimKmeansModel5','dbadmin');

다음과 같은 결과를 받아 볼 수 있다.

k-Means Model Summary:

Number of clusters: 3 --클러스터의 수

Input columns: id, sepal_length, sepal_width, petal_length, petal_wdith --입력된컬럼(변수)

Cluster centers: --각 클러스터의 센터값

0: {id: 35.5000000, sepal_length: 5.2828571, sepal_width: 3.2371429, petal_length: 2.2600000, petal_wdith: 0.5542857}

1: {id: 99.0000000, sepal_length: 6.2298246, sepal_width: 2.8508772, petal_length: 4.9070175, petal_wdith: 1.6578947}

2: {id: 139.0000000, sepal_length: 6.5913043, sepal_width: 3.0217391, petal_length: 5.4695652, petal_wdith: 2.0260870}

Evaluation metrics: --평가지표내역

Total Sum of Squares: 281918.87

Within-Cluster Sum of Squares:

Cluster 0: 25370.374

Cluster 1: 12956.619

Cluster 2: 3401.8942

Total Within-Cluster Sum of Squares: 41728.887

Between-Cluster Sum of Squares: 240189.98

Between-Cluster SS / Total SS: 85.2%

Number of iterations performed: 1 --반복수

Converged: False

Call:

kmeans(model_name=ytkimKmeansModel5, input_table=iris, input_columns=*, num_clusters=3,

exclude_columns=species, max_iterations=1, epsilon=0.0001, init_method=random, initial_centers_table=,

distance_method=euclidean, outputView=ytkimkmeansview5, key_columns=id

)

비교를 해보기 위해 반복수를 변경해서 다시테스트해 본다.

SELECT v_ml.kmeans('ytkimKmeansModel7', 'iris', '*', 3,

'--max_iterations=30 --output_view=ytkimKmeansView7 --key_columns=id

--exclude_columns=Species --description="ytkim_iris_kmeans_model7"');

Finished in 12 iterations --30번 반복하라고 했지만 실제로는 12번 수행하였다. 어쭈.

SELECT v_ml.SUMMARIZE_MODEL ('ytkimKmeansModel7','dbadmin');

k-Means Model Summary:

Number of clusters: 3

Input columns: id, sepal_length, sepal_width, petal_length, petal_wdith

Cluster centers:

0: {id: 26.0000000, sepal_length: 5.0450980, sepal_width: 3.4235294, petal_length: 1.5254902, petal_wdith: 0.2686275}

1: {id: 76.5000000, sepal_length: 5.9220000, sepal_width: 2.7720000, petal_length: 4.2860000, petal_wdith: 1.3480000}

2: {id: 126.0000000, sepal_length: 6.5938776, sepal_width: 2.9673469, petal_length: 5.5428571, petal_wdith: 2.0163265}

Evaluation metrics:

Total Sum of Squares: 281918.87

Within-Cluster Sum of Squares:

Cluster 0: 11080.685

Cluster 1: 10446.352

Cluster 2: 9842.9029

Total Within-Cluster Sum of Squares: 31369.939

Between-Cluster Sum of Squares: 250548.93

Between-Cluster SS / Total SS: 88.87%

Number of iterations performed: 12

Converged: True

Call:

kmeans(model_name=ytkimKmeansModel7, input_table=iris, input_columns=*, num_clusters=3,

exclude_columns=species, max_iterations=30, epsilon=0.0001, init_method=random, initial_centers_table=,

distance_method=euclidean, outputView=ytkimkmeansview7, key_columns=id

)

이전에 반복수행을 1회할때는 Converged 값이 False 였지만 12회 반복수행을 하였을때는
True(알고리즘이 수렴됨)로 바뀐것을 알 수 있다.

그 외 나머지 요약정보도 다름을 알 수 있다.