- Big Data Analytics - Data Scientist
- Big Data Analytics - Data Analyst
- Key Stakeholders
- Core Deliverables
- Big Data Analytics - Methodology
- Big Data Analytics - Data Life Cycle
- Big Data Analytics - Overview
- Big Data Analytics - Home
Big Data Analytics Project
- Data Visualization
- Big Data Analytics - Data Exploration
- Big Data Analytics - Summarizing
- Big Data Analytics - Cleansing data
- Big Data Analytics - Data Collection
- Data Analytics - Problem Definition
Big Data Analytics Methods
- Data Analytics - Statistical Methods
- Big Data Analytics - Data Tools
- Big Data Analytics - Charts & Graphs
- Data Analytics - Introduction to SQL
- Big Data Analytics - Introduction to R
Advanced Methods
- Big Data Analytics - Online Learning
- Big Data Analytics - Text Analytics
- Big Data Analytics - Time Series
- Logistic Regression
- Big Data Analytics - Decision Trees
- Association Rules
- K-Means Clustering
- Naive Bayes Classifier
- Machine Learning for Data Analysis
Big Data Analytics Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Big Data Analytics - Introduction to R
This section is devoted to introduce the users to the R programming language. R can be downloaded from the
. For Windows users, it is useful to and the .The general concept behind R is to serve as an interface to other software developed in compiled languages such as C, C++, and Fortran and to give the user an interactive tool to analyze data.
Navigate to the folder of the book zip file bda/part2/R_introduction and open the R_introduction.Rproj file. This will open an RStudio session. Then open the 01_vectors.R file. Run the script pne by pne and follow the comments in the code. Another useful option in order to learn is to just type the code, this will help you get used to R syntax. In R comments are written with the # symbol.
In order to display the results of running R code in the book, after code is evaluated, the results R returns are commented. This way, you can copy paste the code in the book and try directly sections of it in R.
# Create a vector of numbers numbers = c(1, 2, 3, 4, 5) print(numbers) # [1] 1 2 3 4 5 # Create a vector of letters ltrs = c( a , b , c , d , e ) # [1] "a" "b" "c" "d" "e" # Concatenate both mixed_vec = c(numbers, ltrs) print(mixed_vec) # [1] "1" "2" "3" "4" "5" "a" "b" "c" "d" "e"
Let’s analyze what happened in the previous code. We can see it is possible to create vectors with numbers and with letters. We did not need to tell R what type of data type we wanted beforehand. Finally, we were able to create a vector with both numbers and letters. The vector mixed_vec has coerced the numbers to character, we can see this by visuapzing how the values are printed inside quotes.
The following code shows the data type of different vectors as returned by the function class. It is common to use the class function to "interrogate" an object, asking him what his class is.
### Evaluate the data types using class ### One dimensional objects # Integer vector num = 1:10 class(num) # [1] "integer" # Numeric vector, it has a float, 10.5 num = c(1:10, 10.5) class(num) # [1] "numeric" # Character vector ltrs = letters[1:10] class(ltrs) # [1] "character" # Factor vector fac = as.factor(ltrs) class(fac) # [1] "factor"
R supports two-dimensional objects also. In the following code, there are examples of the two most popular data structures used in R: the matrix and data.frame.
# Matrix M = matrix(1:12, ncol = 4) # [,1] [,2] [,3] [,4] # [1,] 1 4 7 10 # [2,] 2 5 8 11 # [3,] 3 6 9 12 lM = matrix(letters[1:12], ncol = 4) # [,1] [,2] [,3] [,4] # [1,] "a" "d" "g" "j" # [2,] "b" "e" "h" "k" # [3,] "c" "f" "i" "l" # Coerces the numbers to character # cbind concatenates two matrices (or vectors) in one matrix cbind(M, lM) # [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] # [1,] "1" "4" "7" "10" "a" "d" "g" "j" # [2,] "2" "5" "8" "11" "b" "e" "h" "k" # [3,] "3" "6" "9" "12" "c" "f" "i" "l" class(M) # [1] "matrix" class(lM) # [1] "matrix" # data.frame # One of the main objects of R, handles different data types in the same object. # It is possible to have numeric, character and factor vectors in the same data.frame df = data.frame(n = 1:5, l = letters[1:5]) df # n l # 1 1 a # 2 2 b # 3 3 c # 4 4 d # 5 5 e
As demonstrated in the previous example, it is possible to use different data types in the same object. In general, this is how data is presented in databases, APIs part of the data is text or character vectors and other numeric. In is the analyst job to determine which statistical data type to assign and then use the correct R data type for it. In statistics we normally consider variables are of the following types −
Numeric
Nominal or categorical
Ordinal
In R, a vector can be of the following classes −
Numeric - Integer
Factor
Ordered Factor
R provides a data type for each statistical type of variable. The ordered factor is however rarely used, but can be created by the function factor, or ordered.
The following section treats the concept of indexing. This is a quite common operation, and deals with the problem of selecting sections of an object and making transformations to them.
# Let s create a data.frame df = data.frame(numbers = 1:26, letters) head(df) # numbers letters # 1 1 a # 2 2 b # 3 3 c # 4 4 d # 5 5 e # 6 6 f # str gives the structure of a data.frame, it’s a good summary to inspect an object str(df) # data.frame : 26 obs. of 2 variables: # $ numbers: int 1 2 3 4 5 6 7 8 9 10 ... # $ letters: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ... # The latter shows the letters character vector was coerced as a factor. # This can be explained by the stringsAsFactors = TRUE argumnet in data.frame # read ?data.frame for more information class(df) # [1] "data.frame" ### Indexing # Get the first row df[1, ] # numbers letters # 1 1 a # Used for programming normally - returns the output as a pst df[1, , drop = TRUE] # $numbers # [1] 1 # # $letters # [1] a # Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z # Get several rows of the data.frame df[5:7, ] # numbers letters # 5 5 e # 6 6 f # 7 7 g ### Add one column that mixes the numeric column with the factor column df$mixed = paste(df$numbers, df$letters, sep = ’’) str(df) # data.frame : 26 obs. of 3 variables: # $ numbers: int 1 2 3 4 5 6 7 8 9 10 ... # $ letters: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ... # $ mixed : chr "1a" "2b" "3c" "4d" ... ### Get columns # Get the first column df[, 1] # It returns a one dimensional vector with that column # Get two columns df2 = df[, 1:2] head(df2) # numbers letters # 1 1 a # 2 2 b # 3 3 c # 4 4 d # 5 5 e # 6 6 f # Get the first and third columns df3 = df[, c(1, 3)] df3[1:3, ] # numbers mixed # 1 1 1a # 2 2 2b # 3 3 3c ### Index columns from their names names(df) # [1] "numbers" "letters" "mixed" # This is the best practice in programming, as many times indeces change, but variable names don’t # We create a variable with the names we want to subset keep_vars = c("numbers", "mixed") df4 = df[, keep_vars] head(df4) # numbers mixed # 1 1 1a # 2 2 2b # 3 3 3c # 4 4 4d # 5 5 5e # 6 6 6f ### subset rows and columns # Keep the first five rows df5 = df[1:5, keep_vars] df5 # numbers mixed # 1 1 1a # 2 2 2b # 3 3 3c # 4 4 4d # 5 5 5e # subset rows using a logical condition df6 = df[df$numbers < 10, keep_vars] df6 # numbers mixed # 1 1 1a # 2 2 2b # 3 3 3c # 4 4 4d # 5 5 5e # 6 6 6f # 7 7 7g # 8 8 8h # 9 9 9iAdvertisements