Sunday, 30 August 2015

Data Preparation using SAS

Before doing any data analysis, there are tasks which are critical to the success of the data analysis project. That critical task is known as data preparation. You may have heard that in the last years the data production is expanding at an astonishing pace. Experts now point to a 4300% increase in annual data generation by 2020. This can be due to the switch from analog to digital technologies and the rapid increase in data generation by individuals and corporations alike. The most of the data generated in the last few years are unstructured.

In the above context, it is highly important to prepare your data from the unstructured dataset to a structured dataset to do a meaningful analysis.
“Data preparation means manipulation of data into a form suitable for further analysis and processing”



“Data Preparation techniques consists of Cleaning, Integration, Selection and Transformation”
We will discuss some of the data preparation techniques in SAS using SAS. INFORMAT is used to read the data with special characters. FORMAT is used to display the data with special characters.

Data DP.Practice;

length City $10.;
input City $ ID $ Age Salary DOJ Profit;
informat Salary dollar6. DOJ ddmmyy10. Profit dollar7.2;
format Salary dollar6. DOJ ddmmyy10. Profit dollar7.2;
label DOJ = "Date of Joining";
rename Salary = Salary_of_Employee;
datalines;
Bangalore T101 24 $2,000 12/12/2010 $300.50
Pune T102 29 $3,000 11/10/2006 $400.50
Hyderabad T103 $5,000 12/10/2008 $500.70
Delhi T104 $6,000 12/12/2009 $450.00
Pune T105 $7,000 12/12/2009 $450.00
;
run;


On the above SAS code, we have used both the INFORMAT and FORMAT to read and display the data with special characters. The SAS INFORMAT statement read the salary as numeric variable and in a specific format i.e. $5,000 which is of 6 characters including $. The FORMAT statement displays the same in your input data. Rename and label statements helps modify the variables metadata for further understanding of the dataset.
We will apply some transformations techniques in a dataset which helps us to apply some advanced analytical techniques in the data. We have a dataset that has various attributes of a customer who has subscribed or not subscribed an edition. In our dataset we have a categorical variable status which holds the observation either “Subscribed” or “Not Subscribed”.  We can transform the categorical variable into a dichotomous variable to run a logistic regression on our dataset.

Data media01;
set DP.media;
length status $15;
If status =”subscribed” then status = “0”;
else status = “1”;
run;

On the above SAS code, we have applied simple If Else statements to transform our dataset called media. Transforming a categorical variable into a dichotomous variable helps us to apply the analytical techniques that we want to run in our dataset. Once after the transformation is done, the dataset is good to go for the next stage i.e. data analysis.

The more you torture your data i.e. Data Preparation, the more the success on the outcome of the data analysis.

No comments:

Post a Comment