Decision making under uncertainty

Share on:

“We sail within a vast sphere, ever drifting in uncertainty, driven from end to end.”

— Blaise Pascal

People engage in decision making on a daily basis and although most of our decisions seem to derive from logical arguments, the truth is that exactly the opposite is true. Most of the decisions we make are due to biases and heuristics our minds have developed to respond to the environment around them faster.

These tendencies have been first described and studied by Amos Tversky and Daniel Kahneman and have been explained in Kahneman’s excellent book: Thinking fast and slow. I highly recommend the book since it is an excellent guide in where our minds tend to be error prone and mislead us. I will explain in brief however that our minds operate as two separate system called by Kahneman and Tversky system 1 and system 2.

System 1 is the default system of the mind, the one used for everyday activities. These activities include walking, breathing, detecting danger and avoiding it, eating, doing routine activities, and all activities that do not involve heavy cognitive effort. System 1 is emotional, automatic, fast and uses no extra energy from our energy hungry brains.

System 2 is the exact opposite of System 1. It is rational, quite slow, and requires a large amount of energy and effort from our part. System 2 allows you to analyse in detail every situation, however this is not always ideal. If a car would come speeding towards you, System 2 would analyse the colour of the car, the brand and maybe the speed and approach vector, which is a nice thought exercise but might get hit by the car before System 2 would act. System 1 will not bother with all those details, it will just register car coming towards me with great speed, I need to move out of the way.

The problem when making decisions is that there are situations moderated or mediated by the aforementioned biases and heuristics in which System 1 might make the decision and our minds might think it is actually System 2. Otherwise said, we might make decisions for subjective reasons and have a high degree of confidence in then believing they are objective.

In this article I would like to show some methods that can reduce these types of biases and help in engaging System 2 in the decision making process.

Distribution Types

I would like to start with some examples how some situations can seem similar on the surface and have massively different outcomes. This is one of the most common situations in which System 1 makes the decision and we might not notice.

I would like to start with some examples as it is way easier. Let us say we are at a music festival (I know, I know…) together with 9,998 other people. We are lucky enough to have information about this crowd and we know that the average age is 24.5 years old. We also know that the average net income for the crowd is $65K. With this information at our disposal we can play around with the impact of one person.

Firstly, I would like you to think of the oldest person you know. It does not have to be alive any more. What would happen if we were to add their age to the average? Since I cannot read your mind, I will jump in with the oldest person that pops to my mind, Methuselah at a respectable 969 years old. So, let’s see the impact of Methuselah on the average age.

\[m_{age} = \frac{10,000 \times 24.5 + 969}{10,001} = 24.5944406\]

Not much of an impact, is it?

Secondly, let us try the same impact for net worth. Think of the richest person you know and try to figure out what would their impact be on the crowd, how much will the average change?

I’m thinking of Elon Musk although at the time of writing this article he is worth a measly $169B as opposed to Jeff Bezos’ $193B. I’m a fan of Musk so let’s check with his net worth.

\[m_{net\_worth} = \frac{10,000 \times 65,000 + 169,000,000,000}{10,001} = \$16,963,304\]

Quite a difference, right? I’d say so.

This difference is because the two characteristics of the music enthusiasts population follow two different distribution types. The first example follows a normal distribution while the second one follows a power law distribution.

Some basic characteristics of population distribution will be essential in our analysis, so I will take a few moments to explain the two types of distribution I mentioned earlier, although a distribution can have many other variations.

Normal Distribution

As mentioned above the age of the music enthusiasts is distributed normally. A normal distribution is a phenomenon observed by Carl Gauss and it can help us a lot in our analysis due to its properties.

A normal distribution means that most individuals in a population are close to the average and the further we go from the average the fewer individuals in the population. If we take height as an example, most people will be around average height in a population and the ones that are extremely tall or extremely short being a rarity. The more extreme the height (taller or shorter) the less likely to find someone with that height.

If a population has an average height of 1.75m (5.74 ft), you will have an easier time finding someone who is 1.9m (6.2 ft) than someone who is 2.13m (7 ft). Also the chances of finding someone who is 7km (4.3 mi) tall is virtually zero.

This is a good thing because it allows us to asses a population quite easy using a sample. Even if our estimates based on the sample are off, if the sample was not biased, we can be fairly certain that we will not be off by much.

Data that falls under a normal distribution are usually the ones that measure natural phenomena. They can include but not be limited to:

  • height
  • weight
  • age
  • reaction times
  • average time spent playing video games

Power Law Distribution

The other effect we have observed is called a power law distribution, or for those who work in the financial sector a Pareto Distribution, and I think it is best know as the 80-20 rule most commonly remembered as: “20% of the people control 80% of the assets.”

This type of distribution can masquerade itself as a normal distribution but is not. It has a lot of dwarves and a few giants and that is why you cannot make any estimations using any of the normal distribution indicators. If you take a sample of dwarves you cannot estimate the impact of a giant.

Ignoring this fact is why we have a lot of people calling themselves actors, writers or singers and not what they actually do for a living, most likely bartenders, waiters or other job unrelated with their self-reported career.

Distributions in Decision Making

Here is where it is important to know a bit about distributions.

If I were to propose a bet in which you are allowed to sample 100 people from the festival mentioned earlier and give some information about the population based on that sample, in which bets you would engage?

  1. Education level
  2. Number of cars owned
  3. Amount of beer drunk weekly

I would go for education level since it is highly unlikely someone has spent 50 years in education. And even if it is so, the impact to 10,000 will be small. The same goes for amount of beer drunk weekly. If there is someone who drinks thousands of beers weekly you might be in trouble, but not as much as they are. I think you would be safe with that bet.

But what about the amount of cars? Well, here it can get a bit tricky. Most people have one or two cars. But what happens if in the crowd is the Sultan of Brunei with a car collection of 7000 cars? Considering the low age of the crowd, it is possible that he owns more cars that the whole crowd put together. This would not be a problem if you can include him in the sample however a sample of 100 people out of 10,000 give you a chance of 1% to include him in it. This means that 99 times out of 100 your sample will underestimate the average number of cars owned by the crowd.

This is why it is important to try and figure out what kind of distribution has the population to which you are trying to make a decision and adjust your decision making accordingly. In the following section I will describe a problem which would be interesting to solve, what career would be best if I were to choose again, and see how we can take advantage of these distributions.

The problem

In the next section I would like to show in brief what would be my thought process if I were to choose a career once more and assuming that I have the knowledge I have now.

First of all I would need to know what do I expect from a career, money, fame, free time, all of those? There can be multiple variables analysed here, however for the sake of simplicity I will keep it short and analyse just one. So, in this analysis the most important factor in choosing a career is how much income can this career generate during a person working life.

Now, what careers are on the short list?

  1. The first option would be doctor. This is a good option in my opinion as it is supposed to be well paid, and one can find work almost everywhere. Wherever there are people, there are sick people and a doctor will always be in demand. A medical career offers a lot of variety in both specialities and type of practice which is always good to have.
  2. The second career option would be one in IT. This is also rumoured to be well paid and besides that it can give one a lot of puzzles with which to play. I like challenges and spending time figuring out puzzles so this seems a good choice for me. As a bonus, any mistakes made as a programmer can be easily patched and fixed, and most mistakes made by programmers do not endanger anyone’s life, as opposed to a doctor’s mistake.
  3. The third option is one that is highly appealing because it can generate a lot of income with minimal effort. This is on the surface and it remains to be seen when we analyse the data. The career is poker player. I like the idea to create my own schedule and work when I want from where I want. It remains to be seen if this is such an appealing option as it seems at first.

The method

Everything in nature is subject to variety so we will assume that the income will vary as based on different factors. Therefore we cannot generate a single number and say “There! this is the expected income if you choose this career!”.

As an example, if I were to choose to study medicine, I might have an idea of what kind of doctor I want to be, however that does not guarantee that I will end up in that particular field. I might not like the field once I go into the nifty gritty of it, or I might like another field more once I find out more about it. Since I cannot control that I will try to simulate multiple outcomes and see overall which career is the best taking into account its highs and lows.

E.g.: If I were to choose between being a janitor who just won the lottery and being a doctor, I would chose to be the janitor. However, if I were to chose between the two and live 10,000 I would choose the doctor because the other 9,999 time it is the better outcome.

How do we do this?

We will gather data on each of the three career paths we have chosen and we will analyse it to try and figure out which are the main characteristics of the three populations, doctors, IT workers, and poker players. We will try to find out what is the average salary, how much does it usually vary from there, and if there is a tendency of most salaries to be over or under that average. Once we have all this information, we will extract a sample of that data of 10,000 randomly chosen salaries.

The salary information will be from the Romanian job market since I am based in Romania.

Coding examples

Since we are analysing data we will need a tool that will allow us to do so. I will use R since it is a versatile tool that will allow us to create distributions and visualise them quite easily. The code can be completely ignored if you do not want to reproduce or analyse it, you can just focus on the graphs and the results. If you want to learn more about R I would recommend R for Data Science, a brilliant book that is the perfect starting place.

Let’s start coding. At first I would like to set up the variables that I will use in this article. Since we will use these values multiple times in the article, I will store them here and if we need to adjust them at a later date, I can just change the values once, in here.

The first values we want to encode are the average salaries. At the moment you can ignore the values, they will make sense when we come to the analysis itself.

# Medical Salaries
dr_sal_mean <- 21100

# IT Salaries
it_sal_med_high <- 225000 # Median Salary for Managers and Directors in IT
it_sal_med_low  <- 92809  # Median Salary for the everyone else working in IT 

The salaries are encoded in here not later when I am analysing the careers themselves because it is a good practice to keep the variables in one place and the same goes for custom function. I will add some in here as well. This functions just store some lines of code that we will need to repeat over and over again and it is good coding practice to store such lines of code in one function so I can avoid having too many lines of code. Also, if I want to change something at a later date, I just need to adjust everything in one place.

# Graphical function to avoid writing the same lines over and over
histogram <- function(.df, .var = Income){
  
  .df %>% 
    ggplot(aes({{.var}})) +
    geom_histogram(fill = "#2484cc", colour = "white") +
    labs(x = "Income", y = "") +
    scale_x_continuous(labels = label_dollar(prefix = "", suffix = "RON"))
}
# Just a simple function to format the Romanian currency so I can use stored variables.
ron_format <- function(.x){
  require(scales)
  
  dollar_format(prefix = "", suffix = " RON")(.x)
}

With this out of the way, we have everything we need set up. The only thing that remains is to start analysing each career and then comparing them.

The careers

The Doctor (of the medicine not Who variety)

Being a doctor is a well respected career in all societies and it tends to be well paid and our target job market, Romania, is not exception to this.

Even though doctors are well paid in Romania, their income varies depending on different factors like the speciality, experience, degree level and many more. My purpose in the next section is to figure out:

  1. What is the average salary for a doctor in Romania
  2. What is the variation from the average
  3. What is the distribution of those salaries

After I have this information, I can apply all the necessary transformations to have the lifetime earnings of a doctor expressed in Euro(€).

Before I can do all that I will need a data source and finding it can get frustrating. Although there are a lot of websites that offer this information, they are not always reliable, so I prefer to use official databases (e.g. eurostat), scholarly articles, or websites that offer transparency regarding their sources.

For this particular situation, I have found salaryexplorer.com. I believe they are reliable in their claims and their sources can be seen at the bottom of the page.. The displayed salaries are in the local Romanian currency, RON, before tax. Applying the tax and converting to € will be easy once I have determined the larger problems mentioned earlier, average salary, variation, and distribution.

The first part is the easiest to figure out. The average salary is mentioned in the beginning of the article, 21,100 RON.

Fortunately, due to the website the second and third questions are quite easily answered as well. Most salaries vary between 7,750 RON and 35,600 RON. Now, all I need to do is figure out the distribution of those salaries.

The fact that the salaries are between 7,750 RON and 35,600 RON and taking a look at the list provided with the salaries for specific jobs I can conclude that the salaries are not distributed according to a power law so they are distributed normally, or as close as normal as to make not much of a difference. Since I am not dealing with a power law I can be pretty confident that even if I am off, I will not be off by much.

With this said, let’s create our sample and take a look at it.

dr_sal_dist <- rnorm(10000, dr_sal_mean, sd = 5000)

tibble(Income = dr_sal_dist) %>% # transforms the list in a table for the graph
    histogram() +
    labs(title = "Monthly Income Distribution",
         subtitle = "Medical Professionals")

The graphic above is a simple representation of how many people earn a specific amount on a monthly basis. As it is expected from a normal, or near normal distribution, most people earn close to the average while a few have extremely high earnings or extremely low earnings.

Now that I have the distribution simulated, I can proceed with the transformation of the data so I will have the salaries in € and calculate the earnings for a lifetime. For that I will need to:

  1. Deduce the taxes on these salaries
  2. Determine how many salaries does a doctor receive in a lifetime
  3. Convert everything to €

The Romanian government imposes taxes on the employee as follows:

  • Social Security: 25%
  • Health Insurance: 10%
  • Income tax on the remainder of 65% of the Gross salary: 10%

This will be translated as follows:

\[ m_{salary(RON)} = (21,100 - \frac{25}{100} \times 21,100 - \frac{10}{100} \times 21,100) \times \frac{90}{100} = 13,715 \times \frac{90}{100} = 12,343.50\]

This gives me the net salary received by a doctor in Romania. All there is now to do is calculate how many such salaries one doctor receives. In Romania medical school lasts for 7 years and afterwards one needs to spend 3 to 5 years in residency. Since residency is paid I will include it in my analysis and I will consider that a doctor starts working when they start their residency. Since most people finish high-school at 19 and med school is 7 years, I will consider that one starts working at 26.

Since I do not plan to work more than I need to, I plan on retiring with the first chance I get. In Romania the retirement age is 65 so I will plan for that. From 26 to 65 there are 39 years of work, multiply that by 12 months I can expect to work for around 468 months.

And for the conversion to €, that is quite easy. According to the National Bank of Romania, the conversion rate is 1 RON for 0.20 € so all I need to do is divide the number by 5. So, let’s code everything:

# First, I will need to extract the taxes so I can compare just the NET salaries
dr_sal_final <- (dr_sal_dist * 0.65) * 0.9

# Now I want to transform them in LIFELONG earnings
dr_sal_final <- dr_sal_final * 468

# And finally, the conversion to €
dr_sal_final <- dr_sal_final / 5  

One more thing I would like to add, I would like to know which is the potential success rate of me becoming a doctor? This is a bit tricky to find out for certain but I will try. Before that I think it is a good idea to write down my thought process and then I can crunch the numbers.

Since applying for medical school requires quite a brutal exam, I will assume that no one applies for it without being serious about it and without having the discipline and study to pass the exam. This information is relevant because it allows me to safely assume that a student once admitted is highly likely to become a doctor.

This is acquired by two mechanisms. The first mechanism is the fact that the extra preparation for the exam allows potential students see the effort required and if they are not willing to put in the effort they can change their mind before the admission exam. The second mechanism keeping students from quitting is the sunk cost fallacy. For this reason, I will only take into account the admittance rate and that can be taken from the university’s website.

In this case I will use the 2020 admissions for the University of Medicine and Pharmacy in Cluj-Napoca. The admittance rate is simple, how many people were admitted over how many applied. The admitted numbers were as follows:

  • General Medicine (Scholarship) - 318 Admitted
  • General Medicine (No Scholarship) - 60 Admitted
  • Dental Medicine (Scholarship) - 81 Admitted
  • Dental Medicine (No Scholarship) - 17 Admitted
  • Rejected - 591

\[admit\_rate = \frac{318 + 60 + 80 + 17}{318 + 60 + 80 + 17 + 591} = 0.4465291 \]

There, my chances are 45%. Now I will code this chance by simply generating a list of 10,000 elements which are either 0 or 1 and 4500 of them are 1. Then I will simply multiply the two lists. If a salary is multiplied by a 0 it will signify that I did not make it as a doctor.

# The list that will show me if I have made it or not
admittance_rate <- rbinom(n = 10000, size = 1, prob = 0.45)

# Now I will add the chance to the previous list
dr_sal_final <- dr_sal_final * admittance_rate

# Just to be sure, I would like to see the rate
print(sum(admittance_rate) / 10000)
#> [1] 0.4472

I will take a look at the numbers in the analysis part, this is just the preparation of the data. Now, let’s see if we can estimate how much does an IT professional earn.

The IT Crowd

Fortunately for me, Romania has one of the most active IT sectors in this part of Europe and it is expanding continuously. This is a good thing for someone looking to start on this career path because it offers a lot of opportunities. All I need now is to check if those opportunities are worth it or not. For this I have found a beautiful analysis on 1,267 salaries for IT professionals that can be seen here.

The sample used in the analysis is large enough to give me a good estimate of how the salaries are distributed in the IT sector. The data is a bit skewed and slightly more than half the people have an income level slightly below the average. This effect is most likely caused by the income of the managers and directors as they have significantly higher income. The higher income raises the average salary value, however this is not a problem, I will just need to estimate the amount of managers and directors and account for that.

As for the medical salaries, I will like first to figure out:

  1. What is the average salary for an IT professional
  2. What is the variation from the average
  3. What is the distribution of those salaries

The first part as for the previous section is quite easy, 11,200 RON.

If I am to look at the average income across all levels and the fact that managers and directors earn significantly more than everyone else, we can approximate that around 15% of the IT professionals are managers and directors. In short, I can rephrase this as the following statement: Assuming that my skill level for the IT industry is average, by seer luck I will end up in a management position 15% of the time, if I were to live 10,000 lives.

# Distribution of IT salaries for software engineers
it_sal_low <- rnorm(8500, it_sal_med_low, 25000)

# Distribution of IT salaries for IT managers and Directors
it_sal_high <- rnorm(1500, it_sal_med_high, 50000)

# The sample of my income across all lives will be the two distributions put together
it_sal_dist <- c(it_sal_low, it_sal_high)

# The average of my salary distribution should be similar to that on the website
mean(it_sal_dist)
#> [1] 111765.1

From what I can see the distribution I have created has the same average salary as the population on the website so I can safely assume that the distribution I have created more or less resembles the population I want to recreate. The distribution looks like this:

tibble(Income = it_sal_dist) %>% 
  histogram() +
  labs(title = "Yearly Income Distribution",
       subtitle = "IT Workers")

This is not a normal distribution, but that was to be expected, however it is not a power law distribution either. This can become a power law if I take into account if I consider developing an app and make it big.

It is a bit complicated to get data on what is the success rate of an app. An interesting and comprehensive summary can be found in this article. I can see that I will need €25,000 as initial investment and I need to make a full time job out of it and afterwards you need to invest in marketing and spend a lot of time adjusting ASO. For me this would have been a potential career if it would have been possible to spend some weekends creating the app, uploading it to an app store and be done with it.

From what I’ve seen there is virtually no chance of that and I do not plan on spending all the effort needed for a successful app so I will not include this in the analysis.

With this in mind, I will assume that the salaries are as shown in the graph above. The mentioned salaries are annual salaries in RON before tax. Now I will just need to adjust them to something more useful.

First of all, I will need to extract the taxes. At the time of this writing IT professionals are exempt from the income tax, so all I would need to pay would be:

  • Social Security: 25%
  • Health Insurance: 10%

Secondly, I need to convert everything into €. As before, I will use the conversion rate from above.

# Net salary
it_sal_dist_net <- it_sal_dist - (it_sal_dist * 0.25) - (it_sal_dist * 0.1)

# The salary converted to €
it_sal_dist_net_eur <- it_sal_dist_net / 5

Now all is left is to see for how long I would expect to work in the industry?

Most people working in the IT industry start working while studying at university. I will not count these jobs because most of them are really low paying or they are unpaid internships. This situation allows people to get a proper job right after finishing university or even a bit before graduation. I will keep my assumption that I will finish high school at 19 so I have a starting point.

If you are studying software engineering, in Romania, you can either study for three or four years, depending if the university follows the Bologna system or not. This does not really matter because it is not the only way to get to work in IT. There are a lot of people who attend a professional conversion and they usually do this at a latter age. So, if I were to account for that as well, I think it would be pretty safe to assume that I would start work around 25.

Considering that my goal is to retire at 65, I have 40 years of work ahead of me, so all I need to do is multiply everything by 40.

it_sal_dist_net_eur <- it_sal_dist_net_eur * 40

As in the first part, I will check the data once everything is prepared. First, I would like to see if my idea of making a living as a poker player is feasible.

Poker Star

This was one of the most exhausting researches I’ve done so far. There are some huge problems I encountered while trying to gather data for this career path. This is to be expected as I imagine that casinos are not willing to let everyone what are the real chances of winning by revealing the numbers. This was to be expected, however I had hoped that I will be able to find some kind of information regarding this but no luck with it.

So I decided to change my tune and see if an online poker career would be better suited? Going online has one additional advantage although that is of no moment, it requires less funds with which to start.

With that being said, in the next section I will like to answer the following questions:

  • What is the probability to make it as an professional online poker player?
  • How much should I expect to make should I make is as a player?
  • How is that money distributed? Is it more likely to win big constantly or is it highly unlikely?

For online poker statistics, as for real life poker, there are no public databases therefore all the information I can access is just the one the online poker sites choose to display. I could use them to make some assumptions, however it can get overly complicated real fast. Most of them have token currencies, not real money so it is hard to account for the money. Also, most of them show the amount earned by a player over his whole time as a player so it is difficult to see who does not have a lot earned because they have just created the account or because they are a poor player.

With that said, I have found the following blog that offers some details about the world of online poker and some data gathered by himself. I know this is not the most reliable dataset, however it is all I have to go with. And I would also like to mention that the data presented by online poker sites is just as unreliable. While the data I am using might be flawed because of unconscious bias, the one presented by websites most likely is biased to make you play. I have also asked people who are involved with online poker to take a look and they said that nothing suspicious pops up, so I will choose to trust the data.

From what I could read in the article, online poker is more grind based and less exposed to chance than it’s face to face counterpart. By that I mean that a player will focus on winning blinds not big hands and over time it will prove successful. The tables are also divided by the size of the blinds as follows:

  • NL2 (also known as 2NL) = ¢1 / ¢2 blinds
  • NL5 (also known as 5NL) = ¢2 / ¢5 blinds
  • NL10 (also known as 10NL) = ¢5 / ¢10 blinds
  • NL25 (also known as 25NL) = ¢10 / ¢25 blinds
  • NL50 (also known as 50NL) = ¢25 / ¢50 blinds
  • NL100 (also known as 100NL) = ¢50 / $1 blinds

Considering that I would have to focus on winning big blinds and not hands, my winnings will be influenced by the value of the big blind. The formula can be seen below:

\[ (winning\_chance \times blind\_value) \times \frac{played\_hands}{100} \] Usually a professional poker player will play around 100,000 hands a month. I think that would be a fair number for my approximations. I am going for this career out of laziness, the opportunity of easy money, and the possibility to make my own hours so I will not overdo it. The next thing that I will need will be the winning rates for each type of table. I will use the probabilities from this blog post and since I have already decided that I will play a fixed number of hands each month, I will use the higher win rates in the tables. Now I will compute the monthly win rate for each table. I will transform them into dollars directly.

\[ Avg\_NL2 = (10 \times \frac{2}{100}) \times \frac{100000}{100} = \$200\]

\[ Avg\_NL5 = (6 \times \frac{5}{100}) \times \frac{100000}{100} = \$300\]

\[ Avg\_NL10 = (4 \times \frac{10}{100}) \times \frac{100000}{100} = \$400\]

\[ Avg\_NL25 = (3 \times \frac{25}{100}) \times \frac{100000}{100} = \$750\]

\[ Avg\_NL50 = (2 \times \frac{50}{100}) \times \frac{100000}{100} = \$1000 \]

\[ Avg\_NL100 = (2 \times 1) \times \frac{100000}{100} = \$2000 \]

These are the estimates for most professional players, however elite players have a different bracket. I am curious to see how much an elite player makes and see how we can distribute the earnings in the population.

\[ Elite\_NL2 = (30 \times \frac{2}{100}) \times \frac{100000}{100} = \$600\]

\[ Elite\_NL5 = (10 \times \frac{5}{100}) \times \frac{100000}{100} = \$900\]

\[ Elite\_NL10 = (12 \times \frac{10}{100}) \times \frac{100000}{100} = \$1200\]

\[ Elite\_NL25 = (10 \times \frac{25}{100}) \times \frac{100000}{100} = \$2500\]

\[ Elite\_NL50 = (9 \times \frac{50}{100}) \times \frac{100000}{100} = \$4500 \]

\[ Elite\_NL100 = (8 \times 1) \times \frac{100000}{100} = \$8000 \]

There seems to be quite a gap between the two brackets. Here we can see that we encounter the same problem as with did with the IT crowd. Elite player earn significantly more that average professional players and they are way fewer. Fortunately for us this is also not a power law distribution because a player is limited on how much he can earn on one hand just from blinds. Since he can play just so many hours in a month it is highly unlikely to see someone earn in the millions in one month. These population of elite players is around 10%. I will have the same approach as I did for the IT workers when creating the distribution.

avg_player <- rnorm(9000, 1400, 300)

elt_player <- rnorm(1000, 4000, 1000)

poker_players <- c(avg_player, elt_player)
tibble(Income = poker_players) %>% 
  histogram() +
  labs(title = "Monthly Income Distribution",
       subtitle = "Professional Poker Players") +
# The salaries are in € so I will adjust them
  scale_x_continuous(labels = label_dollar(prefix = "€"))

The distribution seems OK from what was described in my source. Now, for the final touch, I just need two more steps:

  1. Transform the data in lifelong earnings
  2. Add the probability of actually making it as as professional poker player as we did for the medical career path

So, for how long should I expect to be active in this domain? This is a bit tricky because it is not a career recognised by the Romanian government so there is no pension plan for it. This means that I am not forced to retire at 65, I can do it earlier or later, however since I want to compare apples with apples I will account for working until 65 as for the previous two careers.

And all I need to know now is when can I start working? Since there are no study requirements or skill assessment to start this, you just need to log in an play, I would be able to start any time. If I want maximum income and experience gain, I will start at 19, as soon as I finish high school. This would mean:

\[ months\_played = (65 - 19) \times 12 = 552 \] And now for the last part, the probability to become a professional poker player. As mentioned above, the probability is 10% according to the blog.

# Poker Players LIFELONG earnings
poker_players <- poker_players * 552

# The list that will show me if I have made it or not
success_prob <- rbinom(n = 10000, size = 1, prob = 0.1)

# Now I will add the chance to the previous list
poker_players_final <- poker_players * success_prob

# Just to be sure, I would like to see the rate
print(sum(success_prob) / 10000)
#> [1] 0.0993

Looks good, 9.93% is exactly what it is to be expected.

All I need right now is to analyse the data and see which career path is better for me.

Data Analysis

The first part of any data analysis is tidying the data up. At the moment I have three separate lists and they are not very useful so I will merge them in one table. The easiest way to do this is to create three separate data frames and combine them in one. The code below will do this and them I can start analysing the data.

df_dr <- tibble(Career = "Doctor",
                Income = dr_sal_final)

df_it <- tibble(Career = "IT Worker",
                Income = it_sal_dist_net_eur)

df_pp <- tibble(Career = "Poker Player",
                Income = poker_players_final)

df <- rbind(df_dr, df_it, df_pp)

# it is always a good idea to eliminate from memory intermediary datasets
rm(df_dr, df_it, df_pp)

Now that the data is in one place I can proceed with the data analysis.

The first step for that is to check all the distributions and see if they look anything alike one another.

df %>% 
  histogram() +
  facet_wrap(~ Career, ncol = 1) +
  labs(title = "Lifelong Income",
       subtitle = "ALL Careers") +
  scale_x_continuous(labels = label_dollar(prefix = "€"))

There seems to be quite a difference between the three.

The IT workers seem to have more salaries around €500,000 as lifelong income, a bit smaller that the doctors’. However, they are the only ones that have a 100% success rate in actually being able to earn from this career.

I can also see that the doctors, the ones who made it pass the culling, have quite a normally distributed salary range and from what I can see, which is a good thing as it shows that we can extract some conclusions from the data. It is very tempting to follow this career path as it can have high rewards although the initial admission rate makes me a bit cautious.

The last group that I can see is the poker players. They do not seem to have any clustering, only some small ones around €750,000 and otherwise are quite equally distributed. It’s hard to say when they are as few as they are to begin with. Let’s check them without the 90% that did not make it.

df %>% 
  filter(Income != 0, Career == "Poker Player") %>% 
  histogram() +
  labs(title = "Lifelong Income",
       subtitle = "Poker Players") +
  scale_x_continuous(labels = label_dollar(prefix = "€"))

It is quite clear from the data that this job is quite well paid, most of the ones that made it as a professional player earn around €750K over their professional careers, so that makes them 50% better paid than most IT professionals, and the whales have it better than the doctors, over €3M. The problem with this is even more obvious then it is for the doctors, out of 10,000 only about 10% make it up to here. In my IT sample I have 1567 people who earn over €750K which is more than all the poker players that made it as a professional.

Now, before I dive too deep in this type of discussion I must remember my initial premise, which is the career path that gives me the highest income across 10,000 lives and is the least exposed to luck in getting there. With that said I would like to check the numbers.

# The data as it is now is not useful, I will rearrange it a bit to be better summarised
# This will help later on as well so I will store it in a separate data frame
df_wide <- df %>% 
  group_by(Career) %>% 
  mutate(ID = row_number()) %>% 
  ungroup() %>% 
  pivot_wider(names_from = Career, values_from = Income) %>% 
  select(-ID) 

# And a summary of the data
df_wide %>% 
  summary() %>%   
  kable(format = "html")
Doctor IT Worker Poker Player
Min. : 0 Min. : 0 Min. : 0
1st Qu.: 0 1st Qu.: 409194 1st Qu.: 0
Median : 0 Median : 507647 Median : 0
Mean : 520582 Mean : 581187 Mean : 94284
3rd Qu.:1124075 3rd Qu.: 632457 3rd Qu.: 0
Max. :2126793 Max. :2036033 Max. :3645871

So, this is plain as day, with a lifelong average of €581,178 the IT crew wins the contest. This is because although the individual salaries for the doctors tend to be higher that the ones for IT, quite a few people did not make it as a doctor and their earnings from this field, €0 in this case, move the average down to €520,582.

It is interesting that the Poker players have the highest earners with €3,645,871 however the average income of €94,284.36 is not very appealing to me.

The last thing I would like to test is that indeed there is a significant difference between these careers. It does not matter as much in this context since based on these data I would choose to follow a career in IT. Since it is the most accessible, my consequences in choosing this career are not high. However, if the situation would have been less clear cut I might want to make sure that it matters if I choose one career over another.

This can be done with an Analysis Of Variance (ANOVA). I’m not going to go into details about this, I will just mention that ANOVA is a way of comparing three or more groups and checking if the differences between them are significant or not.

anova_results <- aov(`IT Worker` ~ `Doctor` + `Poker Player`, data = df_wide)

summary(anova_results) 
#>                  Df    Sum Sq   Mean Sq F value Pr(>F)    
#> Doctor            1 3.797e+10 3.797e+10   0.459  0.498    
#> `Poker Player`    1 1.226e+13 1.226e+13 148.153 <2e-16 ***
#> Residuals      9997 8.274e+14 8.276e+10                   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So, this looks interesting. From what this shows me, there is a very significant difference between IT and poker player, which is to be expected. I can tell this by the Pr(>F) value, the smaller it is the less chance that the two groups are the same distribution.

The problem is that the difference between IT and doctor is not significant. This is understandable somehow as both careers have quite similar incomes and a somewhat similar distribution. The doctors have a bell curve while the IT crowd has two, one for the bosses one for the rest of the people. This is why the analysis resulted in no significant changes between the two choices. This would have mattered if I would have decided against the odds to go for a medical career as it would have let me choose the easier path with no significant consequences. This way I am even more satisfied by my decision.

Conclusion

As I hope I have showed here, decisions in the real world are not as easy to make as they are in a classroom where things tend to be clean cut. This can get very frustrating and that is not always a bad thing. Frustration is a good motivator to help one persist in one’s pursuit of understanding and resolving a problem. With patience and perseverance most things can be overcome.

What I hope I have showed here is how you can make decisions with imperfect data by focusing on a few key variables which you can control. Once you have this, it is easy to create a simulation of random outcomes and within this outcome so you can analyse the data. As mentioned earlier in this article, the preferred approach is to simulate multiple scenarios to account for blind luck. You can win the lottery and be better off that any doctor or IT worker or Poker player but you cannot count on that.

Let me know what you think of this article and what would be your approach.