Introduction to EDA

Exploratory data analysis (or EDA) is a way to analyse data sets in order to summarize or to know about their main characteristics, often by analysing trends and through graphics and other data visualisation methods.

We try to find out about the overall look and feel of the data by visualising how it is structured, etc.

Our Data

The data set we are using is the American baby names data set, which contains counts of baby names in the United States from 1880 to 2020.

Lets get right into it

Install dependencies

pip install pandas
pip install matplotlib
pip install streamlit

We shall be using pandas and matplotlib for the core analysis and visualisation, and streamlit for creating the web app for our project.

Start coding

Import the modules

import pandas as pd
import matplotlib.pyplot as plt

Initial EDA

df = pd.read_csv("https://github.com/Rishav-12/baby_name_trends/raw/main/names_ranks_counts.csv")
print(df.head())

We import the csv into a data-frame and look at the first 5 rows. Clearly, each row has a baby's name, gender, the year of birth, and a count of how many times that name was given to a baby in that particular year.

To analyse the trend of a given name year-wise, we should filter the data-frame by name and gender (since some names have both genders).

To do this we write the following code:

baby_name = "John"
gender = "M"
name_df = df[(df.name == baby_name) & (df.sex == gender)]
print(name_df)

Here we see each row has the name "John" and gender "M". We shall later use user input for the name and gender.

Visualization/Graphing

Now, if we want to know how the frequency of the name changed over the years, we can plot the count against the year

plt.plot(name_df["year"], name_df["count"])
plt.xlabel("Years")
plt.ylabel("No. of babies")
plt.show()

We notice that the graph is somewhat not as we expected. Inspected the filtered data-frame, we notice that the years are not sorted. We can easily overcome this by

name_df = name_df.sort_values("year")

If we plot again we get a proper graph

We can change the name and gender to visualize all the different names and how their common they were over the years.

Building the UI

Finally, we can use streamlit, which is a platform for building data apps, for creating a front-end for our application.

First we import it at the top of our file as

import streamlit as st

Now, we can use the st.write() function to write a heading for our app or some other information.

Next, we use two different input methods

baby_name = st.text_input("Baby name", placeholder="Enter a baby name")
gender = st.radio("Gender of the baby", ("M", "F"))

to input the name and gender of our user's choice.

In order to embed the plot in the web page, we must create a matplotlib figure object fig, then use st.pyplot(fig)

Closing remarks

Our web app can be hosted on streamlit cloud. I shall leave that as an exercise to the viewer. Today, we performed EDA and also visualised a large data set (more than 2 million rows long), hope I inspired you to create your own EDA project, maybe with a different data set.

You can find the code on GitHub.

There's also a live version of the web app here

Thank you so much for reading this article. Bye for now!

Analysing American Baby Name Trends with Python

A Python EDA Project

Table of contents