This project aims to provide basic knowledge of data cleaning and various methods for reading and exploring data. In Task 1, several problems in the dataset were identified and resolved. Furthermore, Task 2 involved exploring the cleaned dataset using basic methods.
Task 1 aims to find several problems in given dataset. 5 problems were found in the given dataset which were duplicated rows,incontient,spelling mistake,irregular spacing and missing value.
data.columns = data.columns.str.strip()
data.drop_duplicates(inplace=True)
data = data.applymap(lambda x: x.replace("%%", "%") if isinstance(x, str) else x)
data.loc[8, 'Time period'] = data.loc[8, 'Time period'].replace('-12', '')
data['Colum name'] = data['colum name'].str.replace('miss spelled data', 'target speleed data', regex=True)
data = data.applymap(lambda x: x.lstrip() if isinstance(x, str) else x)
Find missing values
missing_values = data.isnull().sum()
Calculate mean value
Column_G_Median = data['Column_G'].median()
Change missing values to mean value
data['Rural (Residence)'].fillna(Column_G_Median, inplace=True)
A box plot is generated based on the quantitative variable, including Q1, Q2, Q3, and Q4, to help analyze the extent of the data distribution. This box plot compares four different columns: 'High income group,' 'Low income group,' 'Lower middle income group,' and 'Upper middle income group.

The key code to genrate box plot is provided below
data['Total'] = data['Total'].str.rstrip('%').astype(int)
data.dropna().boxplot(column='Total',by='Income Group')
plt.show()
This part is comparing mean and median values of the four different categories' frequency values.

The key code to compare the mean values is provided below.
data.groupby('Income Group')['Total'].mean()
The key code to compare the median values is provided below.
copy_data['Rural (Residence)'].median()
copy_data['Urban (Residence)'].median()
Based on the median values, the list of top 10 countries are displayed. Two different countries are displayed based on the rural area and urban area.

The key code to generate the list of the top 10 countries is provided below.
top_rural_countries = copy_data.nlargest(10, 'Rural (Residence)')[['Countries and areas', 'Rural (Residence)']]
top_urban_countries = copy_data.nlargest(10, 'Urban (Residence)')[['Countries and areas', 'Urban (Residence)']]
In this part the poorest group and richest group were comapred based on the three different statistic which are the mean, median, and standard deviation The key code to generate these statistic measures is provided below
um_data = copy_data[copy_data['Income Group'] == 'Upper middle income (UM)']
percentage_columns = ['Poorest (Wealth quintile)', 'Richest (Wealth quintile)']
statistics = um_data[percentage_columns].describe()
print("Poorest (Wealth quintile) statistics:")
print(statistics.loc[['mean', '50%', 'std'], 'Poorest (Wealth quintile)'])
print("\n'Richest (Wealth quintile) statistics:")
print(statistics.loc[['mean', '50%', 'std'], 'Richest (Wealth quintile)'])