Correlation Analysis

Correlation Analysis

A correlation analysis is a statistical technique which quantifies the relation between two variables in a model. It is used to understand the nature of relationships between two individual variables. It measures the strength of association between the variables.

Correlation analysis gives a great privilege to the Data Scientists to identify the implication of a variable on other variables in the model. With it's prominence in predicting the relationship between the variables, it is considered to be one of the most important techniques used in Exploratory Data Analysis (EDA).

It's been highly misunderstood that correlation analysis determines cause and effect. All that correlation explains is the association between the variables and not the causation. It is much suitable for continuous variables than categorical variables.

A correlation between two variables can be either positive, negative or neutral.

Positive Correlation - Increase in the value of one variable will increase the value of other variable.

Negative Correlation - Increase in the value of one variable will decrease the value of other variable.

Neutral Correlation - Increase or decrease in the value of one variable will have no effect on the other variable.

The strength of association between variables is indicated by the magnitude of the correlation. The scale ranges from -1 to +1, where the more the positive value the more is the strength of association for positive correlation, and the more the negative value the more is the strength of association for negative correlation.

Correlation analysis is very helpful in 

 - estimating the value of unknown variable provided the value of another variable is given, through regression.

 - figuring out the functionally similar variables.

 - identifying the dependency among the variables.

Factors to be considered:

 - Correlation analysis should not be considered when data is repeated in same or varied time intervals.

 - An outlier is essentially an infrequently occurring value in the dataset.

 - It should not be performed, if there is a non-linear relationship between the variables.

 - The sample size should be appropriate.

With the availability of vast open source platforms, the implementation of Correlation Analysis has become much more easier. "corrgram, PerformanceAnalytics, Hmisc, psych and polycor" are some of the R packages used to implement different types of correlations in R.

Hope this would have given a brief description and an idea about Correlation Analysis. In my future articles, I'll discuss in brief about other analytical techniques.

I'll always welcome and value your suggestions. So, please feel free to reach out to me. I'm reachable through the following links.

Email - kgfahath@gmail.com

Popular posts from this blog

Structural Equation Modelling - A basic introduction.

An Introduction to Factor Analysis

Machine Learning - a basic introduction with step by step implementation