- Big Data Analytics - Data Scientist
- Big Data Analytics - Data Analyst
- Key Stakeholders
- Core Deliverables
- Big Data Analytics - Methodology
- Big Data Analytics - Data Life Cycle
- Big Data Analytics - Overview
- Big Data Analytics - Home
Big Data Analytics Project
- Data Visualization
- Big Data Analytics - Data Exploration
- Big Data Analytics - Summarizing
- Big Data Analytics - Cleansing data
- Big Data Analytics - Data Collection
- Data Analytics - Problem Definition
Big Data Analytics Methods
- Data Analytics - Statistical Methods
- Big Data Analytics - Data Tools
- Big Data Analytics - Charts & Graphs
- Data Analytics - Introduction to SQL
- Big Data Analytics - Introduction to R
Advanced Methods
- Big Data Analytics - Online Learning
- Big Data Analytics - Text Analytics
- Big Data Analytics - Time Series
- Logistic Regression
- Big Data Analytics - Decision Trees
- Association Rules
- K-Means Clustering
- Naive Bayes Classifier
- Machine Learning for Data Analysis
Big Data Analytics Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Big Data Analytics - Data Exploration
Exploratory data analysis is a concept developed by John Tuckey (1977) that consists on a new perspective of statistics. Tuckey’s idea was that in traditional statistics, the data was not being explored graphically, is was just being used to test hypotheses. The first attempt to develop a tool was done in Stanford, the project was called
. The tool was able to visuapze data in nine dimensions, therefore it was able to provide a multivariate perspective of the data.In recent days, exploratory data analysis is a must and has been included in the big data analytics pfe cycle. The abipty to find insight and be able to communicate it effectively in an organization is fueled with strong EDA capabipties.
Based on Tuckey’s ideas, Bell Labs developed the S programming language in order to provide an interactive interface for doing statistics. The idea of S was to provide extensive graphical capabipties with an easy-to-use language. In today’s world, in the context of Big Data, R that is based on the S programming language is the most popular software for analytics.
The following program demonstrates the use of exploratory data analysis.
The following is an example of exploratory data analysis. This code is also available in part1/eda/exploratory_data_analysis.R file.
pbrary(nycfpghts13) pbrary(ggplot2) pbrary(data.table) pbrary(reshape2) # Using the code from the previous section # This computes the mean arrival and departure delays by carrier. DT <- as.data.table(fpghts) mean2 = DT[, pst(mean_departure_delay = mean(dep_delay, na.rm = TRUE), mean_arrival_delay = mean(arr_delay, na.rm = TRUE)), by = carrier] # In order to plot data in R usign ggplot, it is normally needed to reshape the data # We want to have the data in long format for plotting with ggplot dt = melt(mean2, id.vars = ’carrier’) # Take a look at the first rows print(head(dt)) # Take a look at the help for ?geom_point and geom_pne to find similar examples # Here we take the carrier code as the x axis # the value from the dt data.table goes in the y axis # The variable column represents the color p = ggplot(dt, aes(x = carrier, y = value, color = variable, group = variable)) + geom_point() + # Plots points geom_pne() + # Plots pnes theme_bw() + # Uses a white background labs(pst(title = Mean arrival and departure delay by carrier , x = Carrier , y = Mean delay )) print(p) # Save the plot to disk ggsave( mean_delay_by_carrier.png , p, width = 10.4, height = 5.07)
The code should produce an image such as the following −
Advertisements