My Blog

For High School Students (and Students at Heart): Sums of Powers of Positive Integers

6 July 2020

The problem of finding sums of powers of positive integers in its general form is to find a closed form formula for \[ \sum_{i = 1}^{n} i^k \] where \(k\) is a nonnegative integer. Many puzzles have been defined around this simple-to-state problem. Typically, the parameter that defines the level of difficulty of such puzzles—perhaps monotonically—is the exponent \(k\).

At the simplest case of \(k = 0\), the problem reduces to an exercise in understanding the summation notation \(\sum\) as it may not be clear for a novice math learner that \[ \sum_{i=1}^{n} 1 = n \] as the summand is seemingly independent of the counting index \(i\).

Next puzzle, and perhaps the most famous one, is the case of \(k = 1\). This, for many, is the first time they heard of the name Gauss. The famous closed form solution in this case is \[ \sum_{i = 1}^n i= \frac{n(n+1)}{2} \] See here for proofs!

When \(k>1\), the dynamics of the problem is different. A typical high school level solution involves a telescopic summation of the difference between consecutive \((k+1)^{\text{th}}\) powers of positive integers. See here for a proof of this kind for the case of \(k=2\). Using such methods, many of us learned that the following formulas are handy and perhaps it is not a terrible idea to memorize them (or their proof technique): \[ \sum_{i=1}^{n} 1 = n \] \[ \sum_{i=1}^{n} i = \frac{n(n+1)}{2} \] \[ \sum_{i=1}^{n} i^2 = \frac{n(n+1)(2n+1)}{6} \] \[ \sum_{i=1}^{n} i^3 = \left(\frac{n(n+1)}{2}\right)^2 \] and of course \(k>3\) is usually left for personal practice as it may not be easy to find insightful solutions. In fact, there is a long history behind this problem (see here for a nice review). The result is of course well-known to a mathematician as Faulhaber's formula or sometimes Bernoulli's formula (Which Bernoulli? ). A quick search over the Internet reveals that the techniques used for the general problem are usually too advanced for a high school student (or a math enthusiast). This is somehow contradictory to what Arthur Engel suggests: a good problem in combinatorics is one that a high school student should not be at a disadvantage compared to a professional mathematician.

In this post, we try to find a simple solution (in the sense of required background) to the general form of the problem. To illustrate the idea, it is beneficial to begin with the well-known cases of \(k = 1\) and \(k=2\): the case of \(k=1\) illustrates the idea and the case of \(k=2\) shows its powerfulness!

The main technique is of course double counting. We make an auxiliary problem that has an easy solution, the answer of which being \[ \sum_{i=1}^n i^k \] together with a second solution, the answer of which is a formula in terms of \(n\) and \(k\).

Consider the following auxiliary problem (all variables are positive integers): Find the number of solutions \((x,y)\) of the inequality \[ x\leq y \leq n \] One way to find the number of \((x,y)\) pairs that satisfy the above inequality is to consider the following cases: If \(y = n\), there are \(n\) possible choices for \(x\), namely \[ x\in \{1, 2, \dots, n\}. \] What if \(y = n-1\)? There are \(n-1\) choices for \(x\), and so on. So, the total number of admissible \((x,y)\) pairs is \[ 1+2+\dots +n = \sum_{i=1}^n i \] which is the summation we aimed for.

A second way of counting the number of solutions of \[ x\leq y \leq n \] is to use our (assumed-to-be-limited) knowledge of combinatorics! Consider two cases:

if \(x\neq y\), we choose \(2\) numbers from \(1,2,\dots, n\) and assign the larger one to \(y\) and the smaller one to \(x\). There are \(\binom{n}{2}\) different ways to do so;
if \(x = y\), there are of course \(n\) possibilities.

From this, the number of solutions will be \( \binom{n}{2} + n. \) This establishes the identity \[ \sum_{i=1}^n i = \binom{n}{2} + n = \frac{n(n+1)}{2}. \]

Next stop is the case of \(k=2\), i.e., sums of squares. The auxiliary problem here is to find the number of admissible \((x_1, x_2, y)\) solutions of the inequality \[ x_1, x_2 \leq y \leq n. \] Note, \(x_1\) and \(x_2\) are not ordered! We could have either \(x_1\leq x_2\) or \(x_1>x_2\).

First way to find the number of admissible triplets is to consider the following cases: If \(y = n\), there are \(n^2\) possible choices for \((x_1, x_2)\), namely \[ (x_1, x_2)\in \{1, 2, \dots, n\}^2. \] As we reduce \(y\), it becomes apparent that there are \[ \sum_{i=1}^n i^2 \] solutions to our auxiliary problem (at this point, one should see how to generalize the auxiliary problem!).

The first solution gives us the summation and the second solution must give the formula! The set of three numbers \(x_1, x_2\) and \(y\) could have various cardinalities. We divide the problem based on \[ \lvert\{x_1, x_2, y\}\rvert \]

if the cardinality is \(3\) none of the three numbers \(x_1, x_2\) and \(y\) are the same. To make such selection, we choose \(3\) numbers from \(1,2,\dots, n\) and assign the largest one to \(y\). We can assign the remaining \(2\) to \(x_1\) and \(x_2\) in \(2\) ways. Hence, there are \(2\binom{n}{3}\) different ways to do so;
if the cardinality is \(2\), the situation is one of the following three \[ x_1 = x_2 \neq y,\quad x_1 = y \neq x_2,\quad x_1\neq y = x_2 \] and therefore there are \(3\binom{n}{2}\) ways;
if the cardinality is \(1\), i.e, \[ x_1 = x_2 = y, \] there are \(\binom{n}{1}\) ways to choose the triplet;

If we add them up, we get \[ \sum_{i=1}^{n}i^2 = 2\binom{n}{3} + 3\binom{n}{2} + \binom{n}{1} = \frac{n(n+1)(2n+1)}{6}. \]

For the general case, our auxiliary problem is to find the number of admissible tuples \((x_1,x_2, \dots, x_k, y)\) for \[ x_1, x_2, x_3, \dots, x_k \leq y \leq n. \] The first solution gives the summation \[ \sum_{i=1}^ni^k. \] For the second solution, the main step is to understand how to count the number of the admissible tuples based on the cardinality argument: for the auxiliary problem \[ x_1, x_2, x_3, \dots, x_k \leq y \leq n \] we want to find the number of admissible tuples \((x_1, x_2, \dots, x_k, y)\) such that that \[ \lvert\{x_1, x_2, \dots, x_k, y\}\rvert = l, \quad 1\leq l \leq k+1. \] Consider having \(l\) boxes and peak \(l\) distinct numbers from \(1,2,\dots, n\), in \(\binom{n}{l}\) ways, to label the boxes. Put \(y\) in the box with the largest number. To count the number of possible ways to place \(x_1, x_2, \dots, x_k\) in the boxes, we use the inclusion-exclusion principle. To place \(x_1\) in one of the boxes, there are \(l\) choices. In total, for all \(x_i\) we get \[ l^k \] possibilities, of which we should subtract those that leave some boxes empty. The inclusion-exclusion principle gives the total number of possibilities \[ \binom{n}{l}\sum_{m = 0}^{l-1}(-1)^m(l-m)^k\binom{l-1}{m}. \]

The total number of admissible tuples can be obtained by summing over all possible cardinalities \(l\): \[ \sum_{i=1}^ni^k = \sum_{l = 1}^{k+1}\binom{n}{l}\sum_{m = 0}^{l-1}(-1)^m(l-m)^k\binom{l-1}{m}. \]

‌

Variational Methods: A Theorem on Channel Capacity

19 March 2020

Consider a communication channel with input random random variable \(X\) and output random variable \(Y\). Assume both \(X\) and \(Y\) are continuous random variables with sufficiently smooth distributions \(f_X(\cdot)\) and \(f_Y(\cdot)\). Also, denote the channel law, that is, the conditional distribution of the output \(Y\) given the input \(X\), by \(W(y| x)\). The channel capacity can be found by solving the following optimization problem: \[ C = \max_{f_X(\cdot)} I(X;Y) \] with \(I(X;Y)\) being the mutual information between \(X\) and \(Y\). Let us reformulate this problem in terms of the input distribution: \[ C = \max_{f_X(\cdot)} \int {f_X(x)W(y|x)\log\left(\frac{W(y|x)}{\int{f_X(z)W(y|z)}\mathit{dz}}\right)} \mathit{dx}\mathit{dy}. \] Assume \(f^*(\cdot)\) is a maximizer of this problem inducing the output distribution \(g^*(y)\) and the mutual information \(I^*\). Then, for any perturbed input distribution \[ f_X(x) = f^*(x) + \epsilon J(x) \] with \(\epsilon\) approaching zero, should give vanishing first order changes in the mutual information. Note, for \(J(x)\) to be an admissible perturbation, we must have \[ \int J(x) \mathit{dx} = 0 \] and \[ \forall x: f^*(x) + \epsilon J(x) \geq 0. \] Let us find the changes in the mutual information upto the first order in \(\epsilon\). Using the first order Taylor expansion of \(\log(\cdot)\), we have \[ I(X;Y) - I^* = \epsilon\left[\iint J(x)W(y|x)\log\left(W(y|x)\right)\mathit{dx}\mathit{dy} + \iiint \frac{f^*(x)}{g^*(y)}W(y|x)J(z)\mathit{dx}\mathit{dy}\mathit{dz} - \iint J(x)W(y|x)\log\left(g^*(y)\right)\mathit{dx}\mathit{dy}\right] + o(\epsilon). \] The term in the square brackets is the variation we care about. For a maximizer distribution \(f^*(\cdot)\), this term must vanish. After some algebra, this means that \[ \int J(x)D_{\text{KL}}\left(W(y|x)||g^*(y)\right)\mathit{dx} = 0 \] with \(D_{\text{KL}}\left(\cdot||\cdot\right)\) denoting the KL divergence. This identity must hold for all admissible perturbations \(J(x)\). The fundamental lemma of calculus of variations then gives the following theorem:

Theorem: For a compactly supported input random variable \(X\), a necessary condition for the input distribution to be capacity achieving is \[ C = D_{\text{KL}}\left(W(y|x)||g^*(y)\right),\quad \forall x\,\, \text{with}\,\, f^*(x) > 0. \]

‌

Support The TERRY FOX FOUNDATION

11 September 2018

This year I am proudly supporting The Terry Fox Foundation in its ongoing work to fund innovative and progressive cancer research programs by taking part in The Terry Fox Run!

I very much hope you will consider sponsoring me in support of my effort, confident in the knowledge that your kindness will impact the lives of so many people living with cancer.

SPONSER ME

A Practice on Inequalities

16 May 2018

I was reading the seminal paper of Arikan that I saw an inequality bounding the symmetric information rate of a binary-input discrete memoryless channel (BIDMC) from above by a function the Bhattacharyya parameter of the channel. I thought maybe it is interesting to prove it by means of elementary inequalities. It was a great exercise! Let me share some background before diving into the inequality problem.

Let \(W:\mathcal{X}\to\mathcal{Y}\) be a BIDMC (so \(X = \{0,1\}\)) with channel law \(W(\cdot|\cdot)\). That is, the conditional probability of output given input is \(W(\cdot|\cdot)\). Let \(W^n:\mathcal{X}^n\to \mathcal{Y}^n\) represents \(n\) parallel uses of \(W\), with the following channel law \[ W^n(y_1^n|x_1^n) = \prod_{i=1}^n W(y_i|x_i)\,. \]

There are two parameters that play an important role in characterizing data rate and reliability when transmitting data over \(W\). One is the symmetric information rate \[ I(W) = \left.I(X;Y)\right|_{X\sim\mathcal{B}(\frac{1}{2})} = \sum_{y\in \mathcal{Y}}{\sum_{x\in \mathcal{X}}\frac{1}{2}W(y|x)\log\left(\frac{W(y|x)}{\frac{1}{2}W(y|0) + \frac{1}{2}W(y|1)}\right)} \] that gives a measure of rate. This is equal to the channel capacity whenever \(W\) is a symmetric channel. The other parameter is the so called Bhattacharyya parameter and is defined by \[ Z(W) = \sum_{y\in\mathcal{Y}}\sqrt{W(y|0)W(y|1)} \] which is an upper bound on twice of the error probability when using a maximum likelihood decoder at the receiver side of \(W\): \[ P_e = \sum_{W(y|1)\geq W(y|0)}{\frac{1}{2}W(y|0)} + \sum_{W(y|0)>W(y|1)}{\frac{1}{2}W(y|1)} = \sum_{y\in\mathcal{Y}} \frac{1}{2}\min \{W(y|0), W(y|1)\} \leq \frac{1}{2}Z(W)\,, \] where the last step uses the fact that the geometric mean of two nonnegative numbers is not less than the minimum of the two. Note also that if the output random variable is continuous, corresponding summations will be replaced by integrations.

The importance of using Bhattacharyya parameter as a measure of reliability is twofold. First, it has very nice analytic properties. Second, we will see that \[ I(W) = 0 \iff Z(W) = 1\,, \] and \[ I(W) = 1 \iff Z(W) = 0\,. \] Thas is, \(Z(W)\) gives a tight estimate of error probability when the channel is either almost-perfect or almost-useless. To verify this claim, the following inequality need to be proved: \[ I(W)\geq \log\left(\frac{2}{1+Z(W)}\right)\,. \] My way of proving is done by taking the following steps: \[ \begin{split} I(W) &= \sum_{y\in \mathcal{Y}}{\sum_{x\in \mathcal{X}}\frac{1}{2}W(y|x)\log\left(\frac{W(y|x)}{\frac{1}{2}W(y|0) + \frac{1}{2}W(y|1)}\right)}\\ &=\sum_{y\in \mathcal{Y}}{\sum_{x\in \mathcal{X}}W(y|x)\log\left(\frac{W(y|x)}{\sqrt{W(y|x)\frac{W(y|0) + W(y|1)}{2}}}\right)} \\ &=\sum_{y\in \mathcal{Y}}{W(y|0)\log\left(\frac{W(y|0)}{\sqrt{W(y|0)\frac{W(y|0) + W(y|1)}{2}}}\right)} + \sum_{y\in \mathcal{Y}}{W(y|1)\log\left(\frac{W(y|1)}{\sqrt{W(y|1)\frac{W(y|0) + W(y|1)}{2}}}\right)}\\ &\geq \log\left(\frac{1}{\sum_{y\in \mathcal{Y}}{\sqrt{W(y|0)\frac{W(y|0) + W(y|1)}{2}}}}\right) + \log\left(\frac{1}{\sum_{y\in \mathcal{Y}}{\sqrt{W(y|1)\frac{W(y|0) + W(y|1)}{2}}}}\right)\quad \text{log-sum inequality}\\ &=\log\left(\frac{1}{\prod_{x\in \mathcal{X}}{\sum_{y\in \mathcal{Y}}{\sqrt{W(y|x)\frac{W(y|0) + W(y|1)}{2}}}}}\right)\\ &\geq 2\log\left(\frac{1}{\sum_{y\in \mathcal{Y}}{\frac{\sqrt{W(y|0)} + \sqrt{W(y|1)}}{2}\sqrt{\frac{W(y|0) + W(y|1)}{2}}}}\right)\quad \text{AM-GM inequality}\\ &\geq \log\left(\frac{1}{\sum_{y\in \mathcal{Y}}{\left(\frac{\sqrt{W(y|0)}+\sqrt{W(y|1)}}{2}\right)^2}}\right) + \log\left(\frac{1}{\sum_{y\in \mathcal{Y}}{\frac{W(y|0) + W(y|1)}{2}}}\right)\quad \text{Cauchy–Schwarz inequality}\\ & = \log\left(\frac{1}{\sum_{y\in \mathcal{Y}}{\left(\frac{\sqrt{W(y|0)}+\sqrt{W(y|1)}}{2}\right)^2}}\right) \\ & = \log\left(\frac{2}{1+Z(W)}\right)\,. \end{split} \]

I finish with an extension of log-sum inequality.

Theorem: Let \(a_i\) and \(b_i\) be positive numbers for \(i = 0,1,\dots n\). Then, for any \(p\in(0,1]\) \[ \sum_{i=1}^n {a_i\log\left(\frac{a_i}{b_i}\right)}\geq \frac{1}{p}\left(\sum_{i=1}^n a_i\right)\log\left(\frac{\sum_{i=1}^n a_i}{\sum_{i=1}^n a_i\left(\frac{b_i}{a_i}\right)^p}\right)\,. \]

The proof is just by using the conventional log-sum inequality: simply consider \(a_i(b_i/a_i)^p\) as the new \(b_i\)! Most importantly, the right hand side of the inequality above is greater than (or equal to) the right hand side of conventional log-sum inequality. That is, \[ \sum_{i=1}^n {a_i\log\left(\frac{a_i}{b_i}\right)}\geq \frac{1}{p}\left(\sum_{i=1}^n a_i\right)\log\left(\frac{\sum_{i=1}^n a_i}{\sum_{i=1}^n a_i\left(\frac{b_i}{a_i}\right)^p}\right)\geq \left(\sum_{i=1}^n a_i\right)\log\left(\frac{\sum_{i=1}^n a_i}{\sum_{i=1}^n b_i}\right) \,. \] If you wonder why this last one holds, you may resort to Holder's inequality.

‌

Just because ...

19 October 2017

One good thing about photography is that there is no need for explanation.

‌

Background Pictures: Night Photography

21 September 2017

I have been playing around with my new DSLR and I decided to post some of my first few long-exposure shots. Feel free to use them as your background picture!

‌

From "Number Theory" of Maryam Mirzakhani

15 July 2017

Suppose \(p(x,y)\) is a bivariate polynomial. The set of points that solve the equation \(p(x,y) = 0\), denoted \(\mathcal{C}_p(\mathbb{R})\), form a curve in the \(xy\) plane. Note \(\mathcal{C}_p(\mathbb{R})\) might be empty, e.g., if \(p(x,y) = x^2 + y^2 + 1\). A point \((x,y)\in \mathcal{C}_p(\mathbb{R})\) is a rational point of the curve \(\mathcal{C}_p(\mathbb{R})\) if \(x\in \mathbb{Q}\) and \(y\in \mathbb{Q}\). The set of all rational points of \(\mathcal{C}_p(\mathbb{R})\) is denoted by \(\mathcal{C}_p(\mathbb{Q})\). Note that even if \(\mathcal{C}_p(\mathbb{R})\) is not empty, \(\mathcal{C}_p(\mathbb{Q})\) may be empty. For instance, if \(p(x,y) = x^2 + y^2 - 3\) then \(\mathcal{C}_p(\mathbb{Q})\) is empty (can you prove that?).

Now, assume \((x_0, y_0)\) and \((x_1,y_1)\) are two distinct points in \(\mathcal{C}_p(\mathbb{Q})\). The slope of the line passing through these two points is a rational number also. This immediately shows that all of the rational points of \(p(x,y)\) can be found by intersecting \(\mathcal{C}_p(\mathbb{R})\) and all lines with rational slope that pass through \((x_0,y_0)\). This is not in general an easy problem. But for the case that \(p(x, y)\) is of degree not greater than \(2\), this is an easy task. Although restricted, yet such polynomials include all conic sections. The procedure of finding rational points of a conic section is shown by an example.

Example: Solve \(x^2 + 5y^2 = 1\) in \(\mathbb{Q}\).

Let \(p(x,y) = x^2 + 5y^2 - 1\). We want to find \(\mathcal{C}_p(\mathbb{Q})\). It is easy to see \((x_0,y_0)=(1,0)\) is in \(\mathcal{C}_p(\mathbb{Q})\). If \((x_1, y_1)\) is any other point in \(\mathcal{C}_p(\mathbb{Q})\), the slope of the line passing through \((x_0, y_0)\) and \((x_1,y_1)\) is \[m = \frac{y_1 - y_0}{x_1 - x_0} = \frac{y_1}{x_1 - 1}.\] The equation of the line passing through \((x_0,y_0)\) with slope \(m\) is \[y = m(x-1).\] To find the intersection of these lines with \(\mathcal{C}_p(\mathbb{R})\), substitute \(y = m(x-1)\) in \(p(x,y) = 0\). Doing that, we get \[\begin{align} 0 &= x^2 + 5y^2 - 1 = x^2 + 5(m^2(x-1)^2) - 1\\ &= (5m^2+1)x^2 - 10m^2x + (5m^2 - 1) \end{align} \] which is a quadratic equation in terms of \(x\). Note \(x = x_0 = 1\) is always a solution. To find the other solution, note that \[ (5m^2+1)x^2 - 10m^2x + (5m^2 - 1) = (x-1)\left((5m^2 + 1)x - (5m^2 - 1)\right). \] Thus, \[ x = \frac{5m^2 - 1}{5m^2 + 1}, \] \[ y = \frac{-2m}{5m^2 + 1}, \] form a rational point. To make it more specific, let \[m = \frac{r}{s}\] where \(r\) and \(s\) are two coprime integers. Then \[ x = \frac{5r^2 - s^2}{5r^2 + s^2}, \] \[ y = \frac{-2rs}{5r^2 + s^2}. \] Therefore \(\mathcal{C}_p(\mathbb{Q})\) is exactly the set of all such \((x,y)\).

Elegance is not required!

7 July 2017

Hello world! This is my first "blog" post, and I want to be brief. I ran into a very interesting theorem about convergence which immediately wiped out a problem that all of my pedantic teachers wanted me to be careful about. I thought this is so cool and is a very great candidate to be the topic of my first post. Here is the theorem:

Theorem: Let \(u(\epsilon):(0, \infty)\to (0, \infty)\) be a function such that \(u(\epsilon) \to 0\) as \(\epsilon \to 0\). Let \(f:\mathbb{R}\to\mathbb{R}\) be a given function. Then the following two statements are equivalent.
1. For all \(\epsilon>0\), there exists a \(\delta>0\) such that when \(|x-x_0|<\delta\), then \(|f(x) - f(x_0)|< u(\epsilon)\).
2. For all \(\epsilon>0\), there exists a \(\delta>0\) such that when \(|x-x_0|<\delta\), then \(|f(x) - f(x_0)|< \epsilon\).

The proof is easy, very easy! The idea is that for any \(\epsilon\), you can find some \(\epsilon'\) such that \(u(\epsilon')<\epsilon\). This is true because \(u(\epsilon) \to 0\) as \(\epsilon \to 0\).
There is no need to do cumbersome math and end up with a clean \(\epsilon\) on the write hand side of the inequalities when you want to prove a limit!