DOMINIC SIMPSON

DOMINIC SIMPSON

UN Population Dataset Project

World population image

View project PDF presentation
GitHub code repository for project

Overview

Overview section - image of atlas spinning

This project is the culmination of my 10-week Data Analytics Bootcamp at Cambridge Spark, which took place from May-July 2024. During that time, I learnt the essential skills required for a career in data, including:

  • Python
  • Pandas
  • NumPy
  • Matplotlib
  • Seaborn
  • Bokeh
  • SQL
  • Maths and Statistics applied to data

I hope that this project will showcase my knowledge, competency, and understanding of the skills when utilising a dataset available publicly.

The dataset that I have used was created by the United Nations Department of Economic and Social Affairs Population Division and is part of the 2024 Revision of World Population Prospects, the twenty-eighth edition of official UN population estimates and projections.

The latest version of the dataset was published on 25th October 2023.  It can be found here on the UN’s Data portal.

The dataset contains mid-year global population estimates and projections (in millions) for the years 2010, 2015, 2021, and 2022. The data was drawn from analyses of historical demographic trends covering some 1,910 national population censuses conducted between 1950 and 2023. To add to this, the dataset also utilized information from vital registration systems and from 3,189 nationally representative sample surveys, as well as some supplementary calculations by the UN Statistics Division.

Other datasets in the series include population projections to the year 2100 based on a range of plausible outcomes.

The dataset covers population statistics both globally and in areas of the world (covering 237 countries or areas), and includes breaking down populations by variables:

  • Population mid-year estimates (millions)
  • Population mid-year estimates for males (millions)
  • Population mid-year estimates for females (millions)
  • Sex Ratio (males per 100 females)
  • Population aged 0 to 14 years old (percentage)
  • Population aged 60+ years old (percentage)
  • Population density

The dataset also contains statistics on surface areas (thousand km2), though this is not something that is explored in this project.

The dataset is available in CSV and PDF format, accompanied by a location list Excel document, which easily breaks down which countries are located in which region, terminology, codes, etc. The raw dataset has 7654 entries (rows of data), and contains the following column headings:

  • ‘Region/Country/Area’, which comprises two columns: a numerical (integer) identifying code for each region, country, or area in the dataset, corresponding to its index in the Excel spreadsheet; and a description of that region, country, or area (e.g., the first entry is ‘Total, all countries or areas’)
  • ‘Year’ (an integer – either 2010, 2015, 2021, or 2022)
  • ‘Series’: contains variables mentioned previously, such as ‘Population mid-year estimates (millions)’, ‘Population mid-year estimates for males (millions)’, etc.
  • ‘Value’: population or surface area value
  • ‘Footnotes’: accompanying information on each entry (e.g. ‘Projected estimate (medium fertility variant)’; for ‘North America’: ‘Including Bermuda, Greenland, and Saint Pierre and Miquelon’)
  • ‘Source’: A typical example will be ‘United Nations Population Division, New York, World Population Prospects: The 2022 Revision, last accessed July 2022’

Aims & Objectives

I hope that my analysis of this dataset will help uncover and identify demographic trends. In particular, I am interested to know the following, and my project hopes to visualize these factors accordingly:

  • World population over the four given years, including overall percentage increase from 2010-2022
  • Population by continents over the four years and how they compare
  • Population growth by continents in percentages and how they compare
  • Country population statistics and how they compare
  • Countries with the fastest-growing populations over the four years (in percentages)
  • Countries with the lowest-growing populations over the four years (in percentages)

It is my theory that while the global population is inexorably rising, and while some continents may be growing faster than others, this does not mean that there is a uniform population growth in each country in that continent. I want to find out which countries in the continents that do register the fastest growth in population are particularly leading the way, compared to others in that respective continent. Likewise, I want to find out which continents have stagnant growth, and which countries in those continents have particularly slow population growth. In so doing, I hope to find the outliers in the data, and tell stories from what the data reveals.

This project is aimed at anyone who is interested in population statistics and demographic trends, and in being able to see how these factors can be visualised in various ways. I studied Sociology & Social Anthropology at University and have always been fascinated with such statistics.

Lastly, it should be pointed out that my findings do not impart any value judgement or personal opinion on the data. I am simply interested in the pure statistics in and of themselves.

Terminology & Notes

  • Sovereign Nation-States

I have used ‘sovereign nation-state’ regularly in the code in the section on countries. This is to differentiate sovereign nation-states recognised by the UN as such from other areas in the data, such as:

  • Global data
  • Continents
  • Sub-continents (e.g., Sub-Saharan Africa)
  • Overseas territories
  • Cultural groupings (e.g., ‘Latin America’)
  • UN development groups such as:
    • Land-locked Developing Countries (LLDC)
    • Small Island Developing States (SIDS)

Moreover, the term country can often be ambiguous. England and Scotland are each countries, but not sovereign nation-states; the sovereign nation-state in question is the United Kingdom. By using the term ‘sovereign nation-state’, I hope to reduce ambiguity, though there is some interchangeable interplay between the terms ‘sovereign nation-state’ and ‘countries’ in my project.

  • Disputed Territories

The data recognises several disputed territories around the world, while not including others, with omissions likely due to specific political tension and sensitivities.

Western Sahara, the State of Palestine, and the Falkland Islands (Malvinas) are all included, for example, while the footnotes for Cyprus mention that the data ‘refers to the whole country’. By contrast, Kosovo is not listed (its only mention is in the Footnotes for Serbia, where it states ‘excluding Kosovo’). Likewise, Taiwan does not have an entry but is mentioned in the Footnotes for China: “For statistical purposes, the data for China do not include those for the Hong Kong Special Administrative Region (Hong Kong SAR), Macao Special Administrative Region (Macao SAR), and Taiwan Province of China”.

  • Continents
  1. Africa: As normal
  2. Antarctica: No permanent population (except research scientists)
  3. Asia: Includes Turkey and the Caucasus (Georgia, Azerbaijan, and Armenia), but not Russia
  4. Europe: Includes Russia but not Turkey or the Caucasus
  5. North America: Includes not just the United States and Canada, but also Greenland, all of the Caribbean, and all of Central America (with the latter sub-region listed as including Mexico, even though the latter is technically geographically in North America)
  6. Oceania: Includes Australia, which is often regarded as a continent in its own right, together with the wider region of New Zealand, Melanesia, Polynesia, and Micronesia
  7. South America: As normal

Data Cleaning & Pre-Processing

In order to manipulate the data and display it visually, my first step was to clean and pre-process the data, the first few rows of which resembled the below before any manipulation.

Image of raw data without cleaning

My second step was to delete the index row beginning with ‘T02’, followed by renaming the columns, so that ‘Region/Country/Area’ was now ‘LocationID’ and ‘Unnamed: 1’ was now ‘Region/Country/Area’.  This made for easier readability:

Data after renaming columns

The next move was to ensure that there were no empty cells in the data, which would cause problems when attempting to visualise the data. The ‘Footnotes’ column had a number of cells with no information in them, which would automatically generate ‘NaN’ (not a number), as can be seen in the screenshot above. Therefore, I ensured that all cells in the ‘Footnotes’ column left empty would return ‘None’:

Image of data after ensuring no empty entries exist

The final step was to ensure that the ‘Values’ column data containing population (and surface area) statistics, was converted to from string data types to numeric float data types. Once again, this was necessary in order for the data to be visualised. There was one obstacle to be overcome first, however: I had to delete commas in the column values, because Pandas treats commas as strings instead of numeric values. Only then was I able to successfully convert the values from a string to float data type.

The final data types before analysis, and confirmation of no ‘NaN’ values left throughout the dataset, can be viewed below.

Final list of data typesImage of confirmation of no empty entries in data set

 

Main Section: Data Analysis

Main Section - image of world map

World

Image of world

My first data visualization shows the global population for the given years 2010, 2015, 2021, and 2022.

The exact figures are:

  • 2010: 6,985.60
  • 2015: 7,426.60
  • 2021: 7,903.30
  • 2022: 7,975.11

I immediately came across my first big issue, which was to show large numbers. To avoid confusion, I converted these values from millions to billions on the plot. For example, the figure for 2010 is approximately 6.9 billion. Furthermore, I also ordered the years chronologically on the y-axis, with 2010 starting first, while the population numbers were on the x-axis. This conversion and ordering make the data easier to read and understand at a glance, even if the user had not been provided with exact figures.

Additionally, rather than sticking with Matplotlib’s default first colour of dark blue, I went for a lighter blue to avoid confusion with the colouring of continents in the next section. It also signifies the blueness of our planet.

Image of data visualization plot showing world population mid-year estimates

Findings:

  • The world population grew from 6,985.60 to 7,975.11 over the given four years.
  • Increase in population from 2010-2022 = Population in 2022 – Population in 2010: 7,975.11m – 6,985.60m.
  • Increase in population from 2010-2022 = 989.51m (i.e., just under a billion)

Secondly, in order to find out the percentage increase in global population when taking into account all of the four given years, I retrieved the global population figures for the four years, then placed all four values in an array, from which I calculated the mean.

I then calculated the percentage increases for each of the consecutive years, and then produced the average overall percentage increase over the four years from this, rounding the decimal to two places along the way.

The resulting plot shows the percentage increases from 2010-2015, 2015-2021, and 2021-2022, along with stating the aforementioned overall percentage increase over the four years.

NB. I made this plot bigger to accommodate the larger title and the reduced number of bars.

Image of plot showing overall average percentage increase in world population for 2010, 2015, 2021, and 2022

 

Findings:

  • The mean global population for the given years was 7574.1525.
  • The average overall percentage increase in global population for the given years was 4.55%.
  • The global population grew at a rate of over 6% from 2010-2015, and slightly increased from this from 2015-2021.
  • Meanwhile, while the data for 2021-2022 is skewed by the fact that it covers only two years rather than five, the slowdown at less than 1% is still striking. It is possible that external events such as the effect of the Coronavirus outbreak may have contributed to this slowdown.

Continents

Image of world continents

For continents, my first step was to retrieve the overall population values for each continent for the four given years.

For North America, this turned out to be a challenge, as there was no overall entry simply called ‘North America’ in the data; instead, there was ‘Northern America’ (Canada and the USA only), with the Caribbean and Central America (the latter including Mexico) classified separately. I had to combine all these regions and sum the populations, then add a section in my modified dataset in the ‘Region/Country/Area’ column called ‘North America’ – one combined region. I then had to rearrange the section so that it matched the original DataFrame format, so that way the data could be listed the same as the other continents, which were more straightforward to handle.

Once the data on each continent was combined into a single DataFrame, I then repeated the step from the first World plot and converted the values from millions to billions, in order to be read more easily, along with adding a legend for each continent.

Matplotlib automatically added its first six out of ten default colours for each of the continents, which you can see in the plot.

Image of plot showing population mid-year estimates (billions) by continent and year

Findings:

Asia vastly outstripped all other continents in terms of population during the given period and continues to do so. Asia’s population during the given period was more than the other continents combined:

  • 2010:
    • Asia: 4,221.17
    • Rest of World (RoW):  2,764.42
  • 2015:
    • Asia: 4,459.44
    • RoW: 2,967.15
  • 2021:
    • Asia: 4,694.58
    • RoW: 3,214.71
  • 2022:
    • Asia: 4,722.63
    • RoW: 3,252.48
  • All continents saw a general rise in population over the four years.
  • However, Europe’s population actually fell from 745.17 in 2021 to 743.56 in 2022 – the only continent to see a reduction in population – even though Europe overall gained in population from 2010 (736.28) to 2022 (743.56).
  • While the plot provided a good enough visualization of the data, it would have helped to have shown the actual figures. In this plot, it is difficult for the viewer to infer that the population in Europe actually dropped from 2021 to 2022, given that exact figures are not provided. Going forward, the solution is to show the exact figures next to each bar in the plot

For the next continent data plot, I repeated the step from the previous section on overall word population, in order to find out the percentage increase in population by continents when taking into account all of the four given years. I retrieved the population figures by continents for the four years, then placed all four values in an array, from which I calculated the mean.

As before, I then calculated the percentage increases for each of the consecutive years, and then produced the average overall percentage increase over the four years from this, rounding the decimal to two places along the way. Once again, the resulting plot shows the percentage increases, only this by time by continent, along with stating the aforementioned overall percentage increase over the four years.

The difference this time is that in the resulting Seaborn plot, I have included the exact percentages above each continent.

Additionally, I have kept the same colour scheme for each continent for consistency.

Image of plot showing average overall percentage increase in population by continent (2010, 2015, 2021, and 2022)

Findings:

  • Africa has seen the fastest growth in population at 10.74%, followed by Oceania, Asia, South America, and North America.
  • By contrast, Europe sits last, at only 0.33% – a very small increase compared to all other continents.

Countries

Image of map of countries

For my focus on statistics by countries, my first step was to deal with outliers in the data: non-sovereign territories. The kinds of non-sovereign territories featured in the data are covered in more detail in the ‘Terminology and Notes: 1. Sovereign Nation-States’ section.

I found that the way to filter the outliers was by adding a column called ‘Sovereign nation-state’ and initializing it. I then created a full list of every sovereign nation-state in the world, matching the exact title of the sovereign nation-state in the ‘Region/Country/Area’ (for example, Tanzania’s official title is ‘United Rep. of Tanzania’ and the Vatican City ‘Holy See’).

Once that list was complete, I then applied a lamda Boolean condition to automatically update the newly-created ‘Sovereign nation-state’ column with a ‘Yes’ or ‘No’ confirmation. The data now had a confirmation for each entry on whether it is a sovereign nation-state or not.

Image of data with new 'sovereign nation-state' column

With this handling of outliers complete, I was then able to construct an interactive Bokeh plot, showing population sizes for each sovereign nation-state, with each circle representing a sovereign nation-state. I ensured that the circles were scaled accordingly so that they were bigger proportionally depending on the population and used a yellow colour to distinguish sovereign nation-states from the colours used in the previous sections. The countries are ordered alphabetically across the screen.

When the user mouses over each circle, it produces a tooltip that tells the user the name of the sovereign nation-state and the corresponding population. As it was impractical to construct four plots for each of the four years given in the data, I simply used the data for the year 2022 in the plot.

A challenge was getting the smaller countries to be visible on the plot; this was overcome by adding a line plot to connect each circle (the blue lines that you can see in the plot).

If you slide to the right, you will see that there are a number of controls available. One of these controls, Box Zoom, allows you to zoom in on each of the circles, for more focus.

Findings:

  • China and India vastly dominated over all sovereign nation-states in population figures in 2022, including the USA in third place, whose population was still less than a quarter of that of either China or India. The two countries remain outliers.
  • China, the sovereign nation-state with the largest population in the world in 2022 (since overtaken by India) at 1,425.89, had only slightly less people than the entire continent of Africa that year, the world’s second-biggest continent by population, at 1,426.74.
  • The majority of sovereign nation-states in 2022 had populations from very small to 100m. This continues to be the case.

For the next country data plot, I then repeated once again the steps from the previous sections, in order to find out the percentage increase in population by sovereign nation-states when taking into account all of the four given years.

However, this immediately set up a problem: it would’ve been impractical to do this manually for every sovereign nation-state like I did with the continents and would’ve meant a vast amount of code (and would also have violated the Don’t Repeat Yourself (DRY) principles of coding).

Instead, I solved this conundrum by:

  • Pivoting the DataFrame so that ‘Region/Country/Area’ was the index while keeping the ‘Year’ and ‘Value’ as columns
  • Calculating percentage population increases over the specified years as usual
  • Calculating average of percentage population increases and storing results in a new column in the DataFrame, ‘Average Country Population Percentage Increase’, achieved via calculating the mean across columns for each row (i.e. row-wise) via axis-1 parameter
  • Rounding the values in ‘Average Country Population Percentage Increase’ as usual to two decimal places
  • Resetting the index to turn ‘Region/Country/Area’ back into a column, so that the DataSet could be used for data visualisation

In addition, I have also included the mean country population figures, to give an indication of the averages in changes in population for each sovereign nation-state over the four years.

Image of data with country mean population and average country population percentage increase columns

With this new data, I was able to construct my second interactive Bokeh plot, this time a slope that shows the average overall percentage growth for each country over the given years. Rather than displaying the countries alphabetically, as in the previous plot, I sorted them by lowest population growth to highest, ensuring that the circles followed a central slope in a trend that could be understood, and employed the same circle colour. I also abandoned displaying the circles’ size in proportion to its population size, as with the previous plot, and instead made each circle the same size, for ease of viewing.

A big challenge was in managing to include all the world’s sovereign nation-states in the plot without the circles overlapping each other. Constructing this plot made me understand how difficult it can be to visualize variables effectively, when those variables can number nearly 200. In order to successfully display so many variables, I reduced the sizes of each circle and employed Bokeh’s Dodge transform function to ensure that the circles did not overlap.

As with the previous plot, if the user hovers over each circle, it shows a tooltip displaying the name of the sovereign nation-state; this time, however, it displays the average overall population percentage increase of that sovereign nation-state over the four given years, rather than overall population.

As before, if you slide to the right, you will see that there are a number of controls available. One of these controls, Box Zoom, allows you to zoom in on each of the circles, for more focus.


Findings:

  • Jordan recorded the highest population percentage increase of any sovereign nation-state over the given four years, with an overall average of 18.56%.
  • At the other end of the spectrum, the Marshall Islands had the lowest population percentage increase of any sovereign nation-state, with an overall average of -6.67% – despite Oceania recording the second-highest population growth of world continents by percentage
  • Countries in the Middle East and Africa dominate the list of nation-states with the highest population percentage increases.
  • Although Oceania and the Caribbean feature in the list of nation-states with the lowest population percentage increases, the geographical region featured most is Eastern Europe.
  • Outside of Eastern Europe, Italy and Portugal also scored badly.
  • The traditional East Asian powerhouses of South Korea, China, and Japan also recorded low percentages, at 2.03, 1.90, and -1.09 respectively. This illustrates that the population of Asia is not growing uniformly across all sovereign nation-states in the continent.

The plot can be difficult on the eye, which is why for my final two visualisations – returning to non-interactive Matplotlib plots – I have shown the top 10 sovereign nation-states with the highest population increases in the world, and conversely the bottom 10 sovereign nation-states recording the lowest population percentage increases, in bar chart form. The two plots can be seen on the next slide.

I have kept to the same colour scheme and sorted the lists of sovereign nation-states accordingly.

The findings for these two plots are the same as previous.

Image of plot showing top ten countries with highest average overall population percentage increases from the years 2010, 2015, 2021, and 2022

 

Image of plot showing bottom ten countries with highest average overall population percentage increases from the years 2010, 2015, 2021, and 2022

Conclusion

Image of world map made of up postage notes

It is my hope that with this project, I have been able take a dataset of raw data and manipulate it to uncover not just general population statistics, but also key demographic trends over the selected period. In doing so, I hope that this project can shine key facts about the population of our planet over the last fifteen years, during which time the global population has increased by nearly a billion people.

I would like to eventually return to this project, in particular to compare it against the aforementioned separate UN dataset of population estimates up to the year 2010. A data visualisation comparison of the two could be extremely interesting.

In the meantime, I hope that you have enjoyed reading this project just as much as I enjoyed creating it.

 

Credits and info

Project undertaken July – August 2024, coded using GitHub Codespaces and Visual Studio Code

POP/DB/WPP/Rev.2024/F0-1

Data copyright @ July 2024 by United Nations, made available under a Creative Commons license CC BY 3.0 IGO: http://creativecommons.org/licenses/by/3.0/igo

United Nations, Department of Economic and Social Affairs, Population Division (2024). World Population Prospects 2024, Online Edition.

Images provided by PowerPoint Stock Images, Creative Commons License / Wikimedia Commons, and NASA.

Photo of Earth from space

Thank you for reading.