word2vec

1.1 $\mathbf{A}\mathbf{x}$ 연산법

행렬과 벡터의 곱에 대하여 다룬다고 해서 해당 내용이 기초적인 선형대수에 대한 것은 아니다. 이 글의 모든 내용은 이미 기초적인 선형대수를 한번은 다 다루었다고 생각하고 데이터 과학과 기계학습의 측면에서 필요한 선행 대수를 다루고자 한다.

1.1.1 행렬과 벡터의 곱 연산법

다음 $\mathbf{A}\mathbf{x}$ 꼴의 연산을 살펴보자.

그러면 일반적으로는 $\mathbf{Ax} = \begin{pmatrix}1 & 2 & 3\\7 & 3 & 10\\ 2 & 3 & 5\end{pmatrix} \begin{pmatrix}x_1 \\ x_2 \\ x_3\\\end{pmatrix} = \begin{pmatrix}1 \cdot x_1 + 2 \cdot x_2 + 3 \cdot x_3 \\ 7 \cdot x_1 + 3 \cdot x_2 + 10 \cdot x_3 \\ 2 \cdot x_1 + 3 \cdot x_2 + 5 \cdot x_3 \end{pmatrix}$

이다. 무엇이 보이는가? 수학적 안목이 뛰어나지 않다면 이는 그저 $3 \times 1$ 크기의 벡터에 불과하다. 이런 곱셈 방식을 정리하면 다음과 같다.

"$\mathbf{A}$의 row vector와 $\mathbf{x}$ column vector의 내적"

하지만 이는 초등적인 방식이고 수학적으로 의미하는 바를 찾기 힘들다. 따라서 우리는

"$\mathbf{A}$의 column vector와 $\mathbf{x}$ 요소의 스칼라 곱"

의 차원에서 살펴볼 것이다. 위 예를 해당 방식으로 표현하면

\[\mathbf{Ax} = \begin{pmatrix}\begin{pmatrix}1 \\ 7 \\ 2 \end{pmatrix}\begin{pmatrix}2 \\ 3 \\ 3\\\end{pmatrix}\begin{pmatrix}3 \\ 10 \\ 5\\\end{pmatrix}\end{pmatrix} \begin{pmatrix}x_1 \\ x_2 \\ x_3 \end{pmatrix} = \begin{pmatrix}1 \\ 7 \\ 2 \end{pmatrix}x_1+\begin{pmatrix}2 \\ 3 \\ 3\\\end{pmatrix}x_2 + \begin{pmatrix}3 \\ 10 \\ 5 \end{pmatrix}x_3\]

이다. 이는 언듯 보면 연산이 복잡해보일 수 있다. 하지만 행렬의 열벡터를 직접 활용한다는 점에서 column space 개념을 활용할 수 있게 해준다.

💡 $\mathbf{A}\mathbf{x}$의 의미

$\mathbf{A}\mathbf{x}$는 $\mathbf{A}$의 열벡터의 선형 결합을 의미한다.

그럼 이제 $\mathbf{A}$의 열벡터 공간 (column space)를 구해보자. 물론 이를 위해서는 $\mathbf{A}$를 RREF꼴로 바꾼후 leading 1에 해당하는 열벡터만이 열 벡터 공간의 basis가 될 수 있다고 하지만, 예에서 사용된 $\mathbf{A}$는 누가 보아도

\[\begin{pmatrix}1 \\ 7 \\ 2\\\end{pmatrix} + \begin{pmatrix}2 \\ 3 \\ 3\\\end{pmatrix} = \begin{pmatrix}3 \\ 10 \\ 5\\\end{pmatrix}\]

이다. 따라서 우리는

\[rk(C(\mathbf{A})) = 2\]

임을 알 수 있다.

1.1.2 열공간의 랭크와 행공간의 랭크

만약 $\mathbf{A}$를 $\mathbf{C} = C(\mathbf{A})$라는 행렬로 분해하고자 한다면 다음과 같이 쓸 수 있다.

\[\mathbf{A} = \begin{pmatrix}1 & 2 & 3\\7 & 3 & 10\\ 2 & 3 & 5\end{pmatrix} = \begin{pmatrix}1 & 2\\7 & 3\\2&3\end{pmatrix} \mathbf{R}\]

이제 $\mathbf{R}$에 해당하는 값을 찾아야한다. 하지만 앞에서 살펴보았듯, 행렬 $\mathbf{A}$의 첫 두 행을 더한 것이 세번째 행이므로

\[\mathbf{A} = \begin{pmatrix}1 & 2 & 3\\7 & 3 & 10\\ 2 & 3 & 5\end{pmatrix} = \begin{pmatrix}1 & 2\\7 & 3\\2&3\end{pmatrix}\begin{pmatrix}1&0&1\\0&1&1\end{pmatrix}\]

이며

$\mathbf{R} = \begin{pmatrix}1&0&1\\0&1&1\end{pmatrix}$ 임을 알 수 있다.

또 한가지 놀라운 사실은 $\mathbf{R}$ 은 $\mathbf{A}$의 row space를 나타낸다는 것이다! 따라서 우리는 1가지 흥미로운 사실을 얻을 수 있다.

Theorem 1

임의의 행렬 $\mathbf{A}$에 대하여 다음이 성립한다.
$dim(C(\mathbf{A})) = dim(R(\mathbf{A})) = rk(\mathbf{A})$

이는 즉, 임의의 행렬의 열벡터 공간의 랭크와 행벡터 공간의 랭크는 항상 같음을 의미한다.

1.1.3 $\mathbf{C}\mathbf{M}\mathbf{R}$ 분해

선형 대수 공부를 하면서 다양한 형태의 분해를 살펴 본다. 다음 섹션 1.2에서 선형대수 대표 5가지 분해에 대하여 복습 하겠지만, 이번에는 조금 새로운 분해인 $\mathbf{C}\mathbf{M}\mathbf{R}$ 분해에 대하여 살펴보자.

$\mathbf{C}\mathbf{M}\mathbf{R}$ 에서

$\mathbf{C}$: $\mathbf{A}$의 linear independent한 column의 모임
$\mathbf{R}$: $\mathbf{A}$의 linear independent한 row의 모임
$\mathbf{M}$: $rk(\mathbf{A}) = r$일 때, $r \times r$ 크기의 $\mathbf{C}\mathbf{M}\mathbf{R} = \mathbf{A}$을 만족 시키는 정방행렬 (Mixing matrix이라 부름)

이제 $\mathbf{M}$을 구해보자. $\mathbf{C}$는 모든 행이 선형독립 (linear independent)하므로 $\mathbf{C}^\intercal\mathbf{C}$는 가역이다. 마찬가지로, $\mathbf{R}\mathbf{R}^\intercal$도 가역이다.

따라서 $\mathbf{M}$은 다음과 같다.

$\mathbf{A} = \mathbf{C}\mathbf{M}\mathbf{R}$ $\mathbf{C}^\intercal\mathbf{A}\mathbf{R}^\intercal = \mathbf{C}^\intercal\mathbf{C}\mathbf{M}\mathbf{R}\mathbf{R}^\intercal$ $\mathbf{M} = (\mathbf{C}^\intercal\mathbf{C})^{-1}\mathbf{C}^\intercal\mathbf{A}\mathbf{R}^\intercal(\mathbf{R}\mathbf{R}^\intercal)^{-1}$

Theorem 2

임의의 행렬 $\mathbf{A} \in \mathbb{R}^{n \times m}$을 다음과 같이 분해할 수 있다.
$\mathbf{A} = \mathbf{C}\mathbf{M}\mathbf{R}$ where

$\mathbf{C}$: $\mathbf{A}$의 linear independent한 column의 모임

$\mathbf{R}$: $\mathbf{A}$의 linear independent한 row의 모임

$\mathbf{M} = (\mathbf{C}^\intercal\mathbf{C})^{-1}\mathbf{C}^\intercal\mathbf{A}\mathbf{R}^\intercal(\mathbf{R}\mathbf{R}^\intercal)^{-1}$

이제 예를 통해 이를 확인하자.

Example 1

$\mathbf{A} = \begin{pmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \\ 10 & 11 & 12 \end{pmatrix}$ 일때 $\mathbf{C}\mathbf{M}\mathbf{R}$ 꼴로 분해하자.

import numpy as np
from sympy import Matrix

npA = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
A = Matrix(npA)

RREF = A.rref()
print("RREF: \n{}".format(RREF[0]))

RREF: 
Matrix([[1, 0, -1], [0, 1, 2], [0, 0, 0], [0, 0, 0]])

npC = np.array([[1, 4, 7, 10], [2, 5, 8, 11]])
npC = npC.T
print("C: \n{}".format(npC))

C: 
[[ 1  2]
 [ 4  5]
 [ 7  8]
 [10 11]]

npR = np.array([[1, 2, 3], [4, 5, 6]])

print("R: \n{}".format(npR))

R: 
[[1 2 3]
 [4 5 6]]

npM = np.linalg.inv(npC.T @ npC) @ npC.T @ npA @ npR.T @ np.linalg.inv(npR @ npR.T)

print("M: \n{}".format(npM))

M: 
[[-1.66666667  0.66666667]
 [ 1.33333333 -0.33333333]]

#lets check if it works!

assumed_A = npC @ npM @ npR

print("A: \n{}".format(assumed_A))

A: 
[[ 1.  2.  3.]
 [ 4.  5.  6.]
 [ 7.  8.  9.]
 [10. 11. 12.]]

잘 분해됨을 알 수 있다!

1.2 행렬의 곱셈 (Matrix-Matrix multiplication) $\mathbf{A}\mathbf{B}$

1.1에서 $\mathbf{A}\mathbf{x}$ 연산법의 핵심은 행렬 $\mathbf{A}$의 각 원소의 곱을 생각하는 것이 아닌, $\mathbf{A}$의 열 벡터를 통해 곱셈 연산을 적용하였다. 행렬과 행렬의 곱셈에서도 이런 원리는 똑같이 적용된다.

1.2.1 일반적인 방식

먼저 행렬의 곱셈을 처음 배울 때 적용되는 방식을 살펴 보자.

행렬 $\mathbf{A} \in \mathbb{R}^{m \times n}$와 $\mathbf{B} \in \mathbb{R}^{n \times k}$의 곱셈은 $\mathbf{C} = \mathbf{A}\mathbf{B} \in \mathbb{R}^{m \times k}$라고 할 떄, 행렬 $\mathbf{C}$의 원소 $c_{ij}$는 다음과 같다.

$c_{ij} = \sum_{l = 1}^{n} a_{il} b_{lj}$ where $i = 1, \cdots, m, j = 1, \cdots, k$

손으로 직접 행렬의 곱셈 연산을 한다면 이와 같은 방식은 꽤나 효과적이다. 하지만 컴퓨테이션의 측면에서 본다면 이는 매우 비효율적이다.

1.2.2 열-행 곱셈법 (Column-row multiplication)

행렬과 행렬의 곱셈에서도 1.1.1의 방식을 사용할 수 있다.

\[\mathbf{A} \in \mathbb{R}^{m \times n} = \begin{pmatrix}\mathbf{a}_1 & \mathbf{a}_2 & \cdots & \mathbf{a}_n \end{pmatrix}\] \[\mathbf{B} \in \mathbb{R}^{n \times k} = \begin{pmatrix}\mathbf{b}_1^* \\ \mathbf{b}_2^* \\ \vdots \\ \mathbf{b}_n^* \end{pmatrix}\]

이므로 행렬의 곱은 다음과 같다.

\[\mathbf{A}\mathbf{B} = \begin{pmatrix}\mathbf{a}_1 & \mathbf{a}_2 & \cdots & \mathbf{a}_n \end{pmatrix}\begin{pmatrix}\mathbf{b}_1^* \\ \mathbf{b}_2^* \\ \vdots \\ \mathbf{b}_n^* \end{pmatrix} = \mathbf{a}_1 \mathbf{b}_1^* + \mathbf{a}_2 \mathbf{b}_2^* \cdots \mathbf{a}_n \mathbf{b}_n^*\]

이는 $m \times k$ 행렬을 $n$번 더한 값이다.

1.2.3 행렬 곱의 컴퓨테이션 시간

위 방식은 손으로 하기에는 번거로워 보일 수 있다. 하지만, 크기가 큰 행렬을 컴퓨터로 연산할 때는 column-row multiplication 방식이 훨씬 효과적이다.

방법1 (rows times columns)
연산 횟수: $n$번의 multiplication을 $mp$번 = $nmp$번의 곱셈 연산

방법2 (coulumns times rows)
연산 횟수: $mp$번의 multiplication을 $n$번 = $nmp$번의 곱셈 연산

따라서 두 방식의 컴퓨터 시간 복잡도는 비슷하다. 그렇다면 행렬 $\mathbf{A}$에 대한 직관을 쉽게 얻을 수 있는 방법2를 사용하는 것이 더 유용하다.