Dot Product is All You Need
Introduction
Most of us have heard about this concept whether through deliberate study from various books, papers, and videos or involuntary knowledge from our friends who talk a lot about mathematics. This concept is often introduced in basic linear algebra courses. Suppose you have two vectors a = [1, 2, 4] and b = [2, 5, 8]. By applying the dot product to those vectors you will get a scalar value of 44. The dot product itself is defined by the following convention.
Despite its simplicity, this exquisite mathematical expression that has a summation of multiplication fashion has been proven as a gold standard that underpins various mathematical modeling in many fields of study. Let’s take a look at the closest implementation of dot product: matrix multiplication. Matrix multiplication leverages the computational aptitude of dot product by considering the left side operand is a collection of row vectors and the right side operand is a collection of column vectors. It could be written as follows.
In information theory, entropy is the fundamental notion that measures the amount of information. It combines two values which are the probability of an event p and its logarithmic value log p, also described as a surprisal or information content. Entropy is (not) surprisingly (no pun intended) exploiting the usefulness of the dot product. Entropy could be seen as a negative dot product of these values.
The expected value could be viewed as a mean in probability theory. The expected value is calculated by taking the summation of multiplication between a random variable x and its respected probability. From the definition, it is indeed a dot product. The formula is stated below.
Standard deviation and covariance in statistics also make use of dot product calculation. Standard deviation measures the scatteredness of the data in the distribution. Standard deviation σ could be calculated by taking a square root of the self-dot product of the difference between the data and its mean followed by a normalization factor. On the other hand, covariance tells the relationship between data in the two distributions. Covariance is measured by utilizing the dot product of the gap of the data and its mean of two distributions followed by a normalization factor. These could be formulated as follows.
In signal processing, Discrete Fourier Transform or DFT uses the dot product to convert a signal in the time domain into the frequency domain. In order to obtain the frequency signal at some point, the original signal, in this case, discrete-time domain, at each time point is multiplied by the rotation factor of the complex plane with a certain frequency at each rotation step. DFT is defined as follows [1].
Convolutional Neural Network or CNN is a longstanding yardstick in Deep Learning and pattern recognition as a high-quality feature extractor in computer vision problems. It harnesses dot product capability in terms of convolution operation. The convolution is carried out by taking element-wise multiplication between kernel/filter and input, which could be an image or feature map, followed by summation at each specific location. The convolution operation in CNN is misleading due to the absence of a flipping mechanism on the input. In fact, this operation is called cross-correlation. Thus, CNN should be named Cross-Correlation Neural Network. This operation is described as follows [2].
These facts give a peculiar and sensible feeling at the same time. How is this possible? How can a simple mathematical strategy that heavily uses multiplication and addition operations could attain a deeper meaning in some context? Is this alien knowledge just like conspiracy theorists speculated how pyramids were built in the same fashion across the world?
Dot Product Interpretation
The dot product is used for the “how to multiply two vectors” purpose alongside the cross product. Geometrically speaking, the dot product measures the multiplication of the magnitude orthogonal projection of one of the two vectors on the other vector and the magnitude of the other vector [5]. Consider the equation of dot product in relation to the law of cosine [6].
We could visualize the dot product in the following picture.
Based on the triangle ratio, the cosine of 𝜃 is calculated as follows.
Hence by incorporating the aforementioned equations.
Where, ||ê|| is the magnitude of the orthogonal projection of vector â on vector ŷ. We could further prove for the vector ŷ projection on vector â but the value of ||ê|| would be different when â ≠ ŷ. The ||ê|| value acts as a surrogate magnitude of the projected vector for the magnitude multiplication. But what does the dot product really tells us about?
As aforementioned, the equation of dot product is expressed closely related to the law of cosine. This fact infers something intriguing about the nature of cosine within the dot product. The value of cosine is zero (middle) when the given angle is 90° or 270°, one (maximum) when the given angle is 0° or 360°, and a negative one (minimum) when the given angle is 180°.
Under the dot product formula, 𝜃 is the angle between the two vectors. The smaller the angle the closer the vectors are. The smaller the angle the higher the cosine is. Hence, cosine measure the similarity value of the two vectors. In another word, applying the dot product to two vectors leads to measuring the similarity of the vectors.
For example, let’s take a look at the covariance equation. It combines the deviation of two different distributions. The similarity in this context could be interpreted as the relationship between these two entities. Are they destroying each other (the value becomes zero), repelling each other (the value becomes negative), or reinforcing each other (the value becomes positive)?
Due to its aggregative nature, it could also be considered the summary of the assorted elements. This property is exhibited in the expected value equation. This is why the expected value is considered the mean of the random variable. This summary-finding process is also be considered a feature extractor because the mean is characteristic of the distribution (rethink about matrix multiplication and CNN).
Dot Product Application
One of the applications of the dot product is image similarity. In ontology, an object tends to be described by its inherent abstract property. This idea could be brought to the computer vision context.
An image is a collection of pixels that contain some numbers. These numbers represent information such as objects or abstract noise in the image. Using Deep Learning, we can construct an image descriptor that is in the form of a vector. Deep Learning leverages the dot product strength which implements matrix multiplication and convolution operation since its acts as the feature extractor.
Suppose we have two images A and B. Image A contains a cat and image B contains a dog. We then feed each of the images to the deep learning model to obtain the vector descriptors. Once again we lend the power of dot product to measure the similarity of the image. In this case, the similarity score will be zero or negative since they have different properties. We could further convert or normalized the score into a zero-to-one scoring fashion for better suited to the probability value.
Summary
Despite its simplicity, the dot product attends in various mathematical formulas. As a baseline theorem that acts as a similarity gauge, it also has various interpretations such as measurement for a relationship, summary reporter, and the feature extractor. Dot product could be used in image similarity for the image descriptor and the matching process.
Reference
[1] https://brilliant.org/wiki/discrete-fourier-transform/
[2] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. Cambridge, Massachusetts: MIT Press, 2017.
[3] https://www.ibm.com/cloud/learn/convolutional-neural-networks.
[4] https://twitter.com/elonmusk/status/1289051795763769345.
[5] https://math.stackexchange.com/questions/805954/what-does-the-dot-product-of-two-vectors-represent.
[6] https://klein.mit.edu/~djk/18_01/chapter05/proof05.html.