Big Number Function - Probability Question

Jesus Barron Barajas - An old Problem Revisted

Posted on [WIP]

note: There is likely to be a lot of typos and poor grammer, this post it yet to be proofread.

Introduction

I think about 3 or more years ago I wanted to make a function or a process that would recieve a very small input and ouput a very large number, I think I called the project Big Number Function (BNF). I want to revist that idea, for a little bit of fun. The question I end landing on was this:

Given a random string of numbers and an integer (m) and probability (p), what is the number of digits (n) you would have to go through such that (m) is repeated (m) time with (p) probability.

For an example if you were given the integer 3 and probablitiy 0.50, then how many digits would you have to go through such that 3 repeats 3 times or (333) time with a 0.50 chance. As you might imagine the output number (n) will start to become quite big for a 0.50 chance once you get to larger numbers for example the digit ten would require the sequence (10101010101010101010).

now this isn't the final question and equation that I solved for and came up with, we'll get to that but first we have to solve this problem. Let's start by setting some assumtions, first the given string of digital is completely random, this means that all digits (0,1,2,3....9) all have the same likelyhood to occur (1/10 or 0.10). This is really good because it means it that the actual numbers in the sequence don't matter only the length this means that (0001) and (9999) have the same likelyhood to happen. This makes the problem a lot easier to solve.

Brainstorming

Now that we set have a question and our assumption let's try and think about what this equation would look like. Well first if we are given the integer 0 as our input m then we know that n will also be zero, since m would not have to repeat at all. next we know that when ($p \rightarrow 1.0$) the output n will approach infinity since it's a random sequence in which the sequence may appear after an infinte amount of digits.

We also know that when given the probability is 0.0 then the output n should also be zero since any other amount of digits would improve the prbability. With that in mind we also know that any output of n for p greater then 0.0 and a non-zero m should be at least m. For m to repeat m times we must atleast look at m digits, for ($m = 3$) we need at least 3 digits to get (333). however that isn't always true in fact for integers that are more then 1 digit the base case is increases. For the lowest double digit integer 10 we would need at least $n = 20$ in other to get (10101010101010101010). So we can generalize the case such that for m integer we need at least m $\cdot$ (the number of digits in m).

So that is our first problem we need to solve what is the lower bound for any integer m. from here on out the sections will broken down into the step to get to the final solution. once we solve this first initial problem we can then move forward to the more complex problem which will hopefully give a big number for a small input.

Lower Bound Issue

As we saw in the last section we know the equation for the lower bound of any integer m which is (m * number of digits in m), we now need to find a way to get the number of digits in m. This is actually not as tricky as one would think, first we know that for numbers (0-9) we should get 1, for number between (10-99) we should get 2, (100-999) 3, and so on. now is there is a function that already does this? Well hopefully alarms are going off in your head if, in which case you are thinking about the log base 10 function. well not exactly we know for digits between 1-10 we get numbers between 0-1, and for digits between 10 - 100 we get a number between 1-2. we can use this property to get the number of digits in our our integer m.

First however I wanna talk a bit about why we see this, ther are a lot of ways to explain it but I think a good way is to consider how we count numbers. We use a base ten method of counting numbers this means that for any digits there are 10 possible values that go there, (0-9) we also know that any integer (ijk) means that we have $i\times 10^2$ plus $j\times10^1$ plus $k\times10^0$. For example 243 means we have ($2 \times 10^2 = 200$), ($4 \times 10^1 = 40$), and ($3 \times 10^0 = 3$), which becomes ($200 + 40 + 3 = 243$). so as you can see the way we represent number is tie directly to the power of ten they are multiplied by in the position in which they are order.

Okay back on task now we can say that the number of digits for any positive integer greater then 1 is given by the equation ($\lfloor1 + \log(m)\rfloor$). Floor function will round down any decimal that it is given for example 0.25 will become 0.0, 4.52 will become 4.0 and so on. As we saw eariler the log function will roughly spit out the number of digits, which we can modify by adding one and using the floor function. Now that we have an idea of what the lower bound is let's try to walk through an example inputs and see how the function should work.

Example Test

So, let's try to find the answer to a simple example and let's see if we can find a pattern that we can use to solve for the general case. Let's say we are given the values ($m = 3$, $p = 0.25$), this means we want to find the sequence 333 in the random string with a probability of roughly 25 percent. one approach is trying to find the probability that we get the sequence after seeing n digits. that is let's make a table of n and the probability we see the sequence, let's the name the event of having the sequence $P(A)$. First we know that becuase of the lower bound the probability for values of n (0,1,2) will all be zero. so our first stop is $n=3$. Here the probability that we get (333) is gonna be the product of the indiviual digits or ($\frac{1}{10}\cdot\frac{1}{10}\cdot\frac{1}{10}=\frac{1}{10^3}=0.001$.

For the next digits 4 we get into a problem there are two location where the sequence (333) can appear it can either be in the first three digits or the last three $333\_ \text{ or } \_333$. In this case let's call each for those events $P(A_1)$ and $P(A_2)$ respectively, thus for the total probability we get $P(A_t)=P(A_1)\cup P(A_2)$. Because our cases are not indepent we can say the the probability become $P(A_t) = P(A_1) + P(A_2)-P(A_1)\cap P(A_2)$. remembering that both event have the same likely hood to happen of 0.001 the equation becomes ($0.001 + 0.001 - 0.001\cdot 0.001$), this comes out to be $P(A_t) = 0.001999$.

but for this process will become more complex for high values of n for example the next number, 5 digits will be even more complex in this format. we have the probably of $333\_\_ \text{ or } \_333\_ \text{ or }\_\_333$. putting it in the format for the previous example we get the following $P(A_t) = P(A_1)\cup P(A_2)\cup P(A_3)$ but we already found the union of the previous problem so it become $P(A_t) = 0.001999 + 0.001 - 0.001\cdot 0.001999$, this probability comes out to be 0.002997001, this is not a bad way to solve but there is an easier approach.

In this problem it is easier to think of the probabilty that the sequence will not occur then using the compliment rule we can solve for the probability that it will happen, which is expressed $P(A_t) = P(\bar{A_t})$. but in this case what is $P(\bar{A_t})$ we for each event $A$ the probability that the sequence is not there is $1-0.001=0.999$ therefore we need $P(\bar{A_1})\cap P(\bar{A_2}) \cap P(\bar{A_3})$ which the product of the three $0.999\cdot 0.999\cdot 0.999$, then we just subtract that value from one to solve for the probability that $A_t$ does occur, $1-0.999^3=0.002997001$. however this method is much easier to solve.

Generalizing Results

Taking what we saw from the previous section we can write a the following equation for the probabilty that the sequence (333) apears after 5 digits to be $P(A_t)=1-(1-\frac{1}{10^3})^3$. in this case the $\frac{1}{10^3}$ is from the lenght of the sequence which we know we can express as $d(m)=m\cdot\lfloor1+\log(m)\rfloor$, where m is the positive integer input, and we know in this case the exponent 3 is based on the lower bound such that $n-(d(m)-1)$ in our case with ($m=3$ and $n=5$) we get ($5-(d(3)-1)=5-(3-1)=3$), puting all of this generalization together we can get the following express

$$P(A_t)=1-(1-\frac{1}{10^{m\lfloor 1+\log(m)\rfloor}})^{n-(m\lfloor 1+\log(m)\rfloor-1)}$$

In our original question we already knew the probability we want, instead we are trying to solve for the value of digits $n$ to do this we just need to alegrabically solve the equation, I will however replace $P(A_t)$ with the single value $p$ for ease.

$$\begin{aligned} &1-p = (1-\frac{1}{10^{m\lfloor 1+\log(m)\rfloor}})^{n-(m\lfloor 1+\log(m)\rfloor-1)})\\ &\ln(1-p) = (n-(m\lfloor 1+\log(m)\rfloor-1))\ln(1-\frac{1}{10^{m\lfloor 1+\log(m)\rfloor}}))\\ &n\ln(1-\frac{1}{10^{m\lfloor 1+\log(m)\rfloor}})=\ln(1-p)+(m\lfloor 1+\log(m)\rfloor-1)\ln(1-\frac{1}{10^{m\lfloor 1+\log(m)\rfloor}})\\ &n=\frac{\ln(1-p)+(m\lfloor 1+\log(m)\rfloor-1)\ln(1-\frac{1}{10^{m\lfloor 1+\log(m)\rfloor}})}{\ln(1-\frac{1}{10^{m\lfloor 1+\log(m)\rfloor}})} \end{aligned} $$

And just like that I think we have solve our original problem, as you can see below in the table depending on the number and the percent we get some merging pattern. Below you can see the values for $p$ being 0.1, 0.5, and 0.9. for the numbers 1-9.

Note: Graph and table are from Desmos, the graph has the y value scaled logirthimic. The values of $m$ range between 1-9 and the probability showen are 0.1, 0.5 and 0.9 respectively.

We can see from the table something very special, it apears that as the value of $m$ increase the value of $n$ approachs a specific number for each probability $p$, in the case of ($p=0.1$) we approach 1.053605 for 0.5 6.931472, and for 0.9 2.302585. This really cuaght my attention doing some digging I found this. The values that n approach is a scaled power of the negative natural log of the complement, that is $-\ln(1-p)$. For the $p$ value 0.5 we get te number 0.69314718056... this what the value of n approaches, as m increase. with that I would suspect that the value of $n$ with $m=10$ and $p=0.5$ will be approximately $-\ln(1-0.5)\times10^{20}$ in this case I'm using 20 because that would be the length of the sequence for 10 to repeat 10 times.

Another interesting fact is that these are also a scale values for which the exponential distribution with a mean of 10 will equal. For example to get the probability of 0.1 integrating the PDF from zero to a certain constant $\int_0^C0.1e^{-0.1x}dx=0.1$ the value C will also come out to be $-10\ln(1-0.1)$.

Now, the good question is why is this the case? well the relation ship between the exponential PDF and the natural log is pretty self evident why does it show up for our equation? Well, I don't have a good answer at the moment but I would suspect that is has to do with the exponential relationship that probability has, and with the fact that we count with a base 10 format. If someone could come up with a good way to mathimatically show the relationship between the two that would be amazing.

Next Problem

Now I think the normal person would stop here, but I was not sastified with my big number function it did not produce a big enough number with a small enough output. So the next question came to mind

Given a random string of numbers, and the inputs m and p, where m is a positive integer and p is a probability between 0 and 1; what is the number of digits n so that the first m digits of the random string repeat in sequence an equal amount to their value?

This question is a bit more wordy, so I'll break it down some more. We are again given a random string (let's call it $r$), $m$ which is a positive integer, and $p$ which is a probability betwen 0 and 1. we are told to index the first $m$ digits of $r$ so our new number (let's call it $l$) is equal to the first $r[1:m]$ numbers. For example let's say we are given ($m=3$) then $l$ could be any 3 digit number from 000-999. As you can imagine the answer to this problem will result in a much larger number for smaller numbers.

Brainstorming Again

Well to tell you'll all the truth I am stuck. But I want to share the stuff I did what I found and maybe yall can help me figure this all out. So the first things that I thought was, that the output $n$ would be the average of the $n$ values from our previous equation. for example let's say we have the value $m=1$ and $p=0.1$ then I would assume that the average which we can express as:

$$\bar{n} = \sum_{j=1}^9\frac{n(j,0.1)}{10}$$

where $n(j,0.1)$ is the previous equation we found. the assumption here is that the average percentage would be 0.1 when we use the first equation we found. solving for $\bar{n}$ I get 11706728.2. the average percentage for the any given value for 1 digit would be:

$$\bar{p}=\sum_{j=1}^9\frac{p(j,\bar{n})}{10}$$

where $\bar{n}$ is the value we found, and $p(j,\bar{n})$ is the first found equation. however if you do solve this you would get around 0.681194646277. which as you could tell is much higher then the 0.1 we were reaching for. If you mess around with the actual value of $\bar{n}$ you'll find that a value of 17 or 18 will get you around 0.1, which is what we're looking for.

I also think there is a chance I'm not defining the problem well enough, another way to think of an answer would be such that the for the same values of $m=1$ and $p=0.1$ maybe we want the accumlative probability of the events to be 0.1? Let's try to write down stuff that we do know; first we know that if we get the value 0 then the event sucessed, since that is one of the 10 posabilities we know that at minimum $n=m$ we always have at least a 1/10 chance of passing.

This gives me an idea let's try to find the probability $p$ for a given $n$ and $m$. As we just saw for $n=1$ and $m=1$, we sucessed at least 1/10 of the time, another 1/10 of the time $l$ will be 1 this means that the probability that 1 will occur 1 time is also 1/10, this means that 1/10 of the time we sucess 1/10 of the time, for the other 8/10 time the probability is zero since $n$ is below the lower bound. so given $n=1$ and $m=1$ we know we have a $\frac{1}{10}\cdot1 +\frac{1}{10}\cdot\frac{1}{10} + 0\cdot\frac{8}{10}$. which is a total of 0.11 percent chance.

If we continue look for example let's know look at $n=2$ with the other stuff the same, the probability when $l=2$ is no longer zero but rather $\frac{1}{10^2}=0.01$, and the probability when $l=1$ will also increase by a bit so the total probability become:

$$P(B) = 1\frac{1}{10}+0.19\frac{1}{10}+0.01\frac{1}{10} = \boxed{0.12}$$

still it's not quite easy to see a pattern at least not one we can easily solve for. but at least we know one thing, that for any value $n\ge 1$ we will at least have a sucess rate of 0.11. This means that asking for any percent lower then at least $\frac{1}{10^m}$ is trivial because the possibility that you get all zero as the first indices of $r$ is equal to $\frac{1}{10^m}$.

Okay. i think that's all I have for now, I will try to come back to this once I get an idea for a good approach, if you wish to contact me about a possible answer just comment on the website profile activity thing, I'll at least check for comments on the weekend. If I was to keep going first I would continue solve for higher values of $n$ like up to at least $10^m-1$ because that would be the point where every possible first index of $r$ would have a non-zero probability. Maybe you'll find a pattern that can be exploited.

Anyways have fun mathing away and good luck! =]. hopefully I finish this at some point.