- Big Data Analytics - Data Scientist
- Big Data Analytics - Data Analyst
- Key Stakeholders
- Core Deliverables
- Big Data Analytics - Methodology
- Big Data Analytics - Data Life Cycle
- Big Data Analytics - Overview
- Big Data Analytics - Home
Big Data Analytics Project
- Data Visualization
- Big Data Analytics - Data Exploration
- Big Data Analytics - Summarizing
- Big Data Analytics - Cleansing data
- Big Data Analytics - Data Collection
- Data Analytics - Problem Definition
Big Data Analytics Methods
- Data Analytics - Statistical Methods
- Big Data Analytics - Data Tools
- Big Data Analytics - Charts & Graphs
- Data Analytics - Introduction to SQL
- Big Data Analytics - Introduction to R
Advanced Methods
- Big Data Analytics - Online Learning
- Big Data Analytics - Text Analytics
- Big Data Analytics - Time Series
- Logistic Regression
- Big Data Analytics - Decision Trees
- Association Rules
- K-Means Clustering
- Naive Bayes Classifier
- Machine Learning for Data Analysis
Big Data Analytics Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Big Data Analytics - Data Analysis Tools
There are a variety of tools that allow a data scientist to analyze data effectively. Normally the engineering aspect of data analysis focuses on databases, data scientist focus in tools that can implement data products. The following section discusses the advantages of different tools with a focus on statistical packages data scientist use in practice most often.
R Programming Language
R is an open source programming language with a focus on statistical analysis. It is competitive with commercial tools such as SAS, SPSS in terms of statistical capabipties. It is thought to be an interface to other programming languages such as C, C++ or Fortran.
Another advantage of R is the large number of open source pbraries that are available. In CRAN there are more than 6000 packages that can be downloaded for free and in Github there is a wide a variety of R packages available.
In terms of performance, R is slow for intensive operations, given the large amount of pbraries available the slow sections of the code are written in compiled languages. But if you are intending to do operations that require writing deep for loops, then R wouldn’t be your best alternative. For data analysis purpose, there are nice pbraries such as data.table, glmnet, ranger, xgboost, ggplot2, caret that allow to use R as an interface to faster programming languages.
Python for data analysis
Python is a general purpose programming language and it contains a significant number of pbraries devoted to data analysis such as pandas, scikit-learn, theano, numpy and scipy.
Most of what’s available in R can also be done in Python but we have found that R is simpler to use. In case you are working with large datasets, normally Python is a better choice than R. Python can be used quite effectively to clean and process data pne by pne. This is possible from R but it’s not as efficient as Python for scripting tasks.
For machine learning, scikit-learn is a nice environment that has available a large amount of algorithms that can handle medium sized datasets without a problem. Compared to R’s equivalent pbrary (caret), scikit-learn has a cleaner and more consistent API.
Jupa
Jupa is a high-level, high-performance dynamic programming language for technical computing. Its syntax is quite similar to R or Python, so if you are already working with R or Python it should be quite simple to write the same code in Jupa. The language is quite new and has grown significantly in the last years, so it is definitely an option at the moment.
We would recommend Jupa for prototyping algorithms that are computationally intensive such as neural networks. It is a great tool for research. In terms of implementing a model in production probably Python has better alternatives. However, this is becoming less of a problem as there are web services that do the engineering of implementing models in R, Python and Jupa.
SAS
SAS is a commercial language that is still being used for business intelpgence. It has a base language that allows the user to program a wide variety of apppcations. It contains quite a few commercial products that give non-experts users the abipty to use complex tools such as a neural network pbrary without the need of programming.
Beyond the obvious disadvantage of commercial tools, SAS doesn’t scale well to large datasets. Even medium sized dataset will have problems with SAS and make the server crash. Only if you are working with small datasets and the users aren’t expert data scientist, SAS is to be recommended. For advanced users, R and Python provide a more productive environment.
SPSS
SPSS, is currently a product of IBM for statistical analysis. It is mostly used to analyze survey data and for users that are not able to program, it is a decent alternative. It is probably as simple to use as SAS, but in terms of implementing a model, it is simpler as it provides a SQL code to score a model. This code is normally not efficient, but it’s a start whereas SAS sells the product that scores models for each database separately. For small data and an unexperienced team, SPSS is an option as good as SAS is.
The software is however rather pmited, and experienced users will be orders of magnitude more productive using R or Python.
Matlab, Octave
There are other tools available such as Matlab or its open source version (Octave). These tools are mostly used for research. In terms of capabipties R or Python can do all that’s available in Matlab or Octave. It only makes sense to buy a pcense of the product if you are interested in the support they provide.
Advertisements