Explain the importance of exploratory data analysis for the given dataset.

Data Science Project: Covid-19 Vaccinations (Globally and in Australia) Dataset Analysis with Python and Microsoft Power BI

INSTRUCTIONS

ABOUT

This report is the continuation of a previous assignment. In the previous assignment, descriptive and inferential statistical analyses were performed on a selected dataset.

The objective of this assignment is to perform further data analyses on the previously selected dataset. (vaccinations.csv) (attached)

The current assignment evaluates your ability to:

• utilise essential data wrangling techniques

• implement basic supervised and unsupervised machine learning models on selected data science and visualisation tools

• construct a viable data science pipeline

TASKS

You are required to apply data wrangling techniques, exploratory data analyses, basic
visualisations, and the linear regression model learned in this unit.

It is required that this project culminates on building a predictive model using linear regression and reporting its performance.

The information gathered from the exploratory data analyses should ideally be used to inform and improve the performance of the linear regression model.

Cross validation should be used to evaluate the model’s performance.

Evidence of the ability to perform feature engineering and visualize data using Python’s py plot and MS Power BI is highly sought after.

Although the specifics of your work are dictated by the objective you set in the previous assignment, you must present your work in a report with the following basic structure:

Title Page
Table of Contents
1. Introduction
2. Objective
3. Method
4. Results and Discussions
5. Project Management
6. Conclusion
7. References

Note: you may add additional sections, where appropriate.

Tasks

• Use Python to structure and clean data, and perform feature engineering on the given dataset

• Identify issues with the given dataset (is there any missing data?)

• Use Python to apply the following data wrangling techniques and feature engineering to the given dataset:

Listwise or case deletion
Mean substitution
Regression imputation
Feature Engineering
Feature construction
Feature extraction
Feature importance and feature selection

Apply three commonly used feature engineering techniques such as one-hot encoding, binning or discretisation, and frequency encoding to the given dataset.

Exploratory Data Analysis

• Explain the importance of exploratory data analysis for the given dataset

• Describe common data exploration methods that can be applied to the given dataset.

• Use Python to perform univariate and bivariate exploratory data analyses, aided by effective data visualisation techniques, on the given dataset.

• Determine the response and the explanatory variables for the given dataset.

• Use Python to plot a scatterplot to depict the direction, form, and strength of the association between the two variables for the given dataset. Interpret the scatterplot.

• Use Python to apply Anscombe’s Quartet to the given dataset and interpret it.
• Interpret the scatterplot and answer the following questions:

a) Do the points form a clear trend with a particular direction, are they more scattered about a general trend, or is there no obvious pattern?

b) If there is a trend, is it generally upward or generally downward as we look from left to right? A general upward trend indicates a positive association while a general downward trend suggests a negative association.

c) If there is a trend, does it seem to follow a straight line (linear association), or some other curve?

d) Are there any outlier points that are clearly distinct from a general pattern in the data?

• Use Python to calculate the correlation coefficient between the two variables for the given dataset.

[The sample correlation r has a number of properties:
Correlation score is always between -1 and 1, inclusive.

The sign of r (positive or negative) indicates the direction of association.

Values r close to +1 or -1 show a strong linear relationship, while values r close to 0 show no linear relationship.]

• What can be observed for the given dataset? Interpret the correlation coefficient.

A number of cautions must be taken when interpreting a correlation coefficient:

a) A strong positive or negative correlation does not (necessarily) imply a cause and effect relationship between the two variables.

b) A correlation near zero does not (necessarily) mean that the two variables are not associated, since the correlation coefficient only measures the strength of a linear relationship.

The two variables may potentially be associated via a non-linear relationship. Plotting a scatterplot could help ascertain the association.

c) Correlation coefficient can be heavily influenced by the presence of outliers. Always plot your data (e.g. using scatterplot) and consider removing any outlier prior to re-calculating the correlation.

• Detect any outliers for the given dataset.

The simplest method for detecting outliers is to use the Inter-Quartile Range (IQR).

The IQR method defines an outlier in a variable as:

Outlier < Q1 – 1.5 × IQR or Outlier > Q3 + 1.5 × IQR
, where Q1 and Q3 are the 25th percentile and the 75th percentile of all values in the variable.

Univariate analysis: Continuous variable

• Using Python, plot a boxplot for the given dataset and inspect if there are any outliers.

Interpret it.

Univariate analysis: Categorical variable

• Plot a barchart for the given dataset. Interpret it.
Bivariate analysis: Continuous vs. continuous variables

• Compute the Pearson’s Correlation between Variable 1 and Variable 2 for the given dataset.

Interpret the result.

• Read the given .csv dataset into a pandas DataFrame. Compute pairwise Pearson’s Correlation between continuous variables in the dataset. Plot the correlation matrix using a heatmap

Bivariate analysis: Categorical vs. categorical

• Plot a stacked barchart for the given dataset. Interpret it.
Bivariate analysis: Categorical vs. continuous
• Plot a swarmplot on top of a boxplot for the given dataset. Interpret it.
• Explain important principles of data visualisation that apply to the given dataset
• Apply effective data storytelling to the given dataset

Knaflic suggests a process for storytelling with data:

a) Understanding the context
b) Choose an appropriate display
c) Eliminate clutter
d) Draw attention where you want it
e) Think like a designer
f) Tell a story

• Use Microsoft Power BI to generate data visualisation dashboard for the given dataset

• Visualise the data in the given dataset in the form of a pie chart or a bar graph.

• The visualisation should achieve the following:

a) Clearly indicate how the values relate to one another.
b) Represent the quantities accurately.
c) Make it easy to compare the quantities.
d) Make it easy to see the ranked order of values.
e) Make obvious how people should use the information

– what they should use it to accomplish

– and encourage them to do this.

• Explain which of the following Gestalt principles can help inform the visualisation efforts: proximity, similarity, enclosure, closure, continuity, connection.

• Use Power BI Desktop (or Power BI online) to create an interactive dashboard (for the given
dataset) that resembles the screenshot below.

Last Completed Projects

topic title	academic level	Writer	delivered