- Python Data Science - Matplotlib
- Python Data Science - SciPy
- Python Data Science - Numpy
- Python Data Science - Pandas
- Python Data Science - Environment Setup
- Python Data Science - Getting Started
- Python Data Science - Home
Python Data Processing
- Python Stemming and Lemmatization
- Python word tokenization
- Python Processing Unstructured Data
- Python Reading HTML Pages
- Python Data Aggregation
- Python Data Wrangling
- Python Date and Time
- Python NoSQL Databases
- Python Relational databases
- Python Processing XLS Data
- Python Processing JSON Data
- Python Processing CSV Data
- Python Data cleansing
- Python Data Operations
Python Data Visualization
- Python Graph Data
- Python Geographical Data
- Python Time Series
- Python 3D Charts
- Python Bubble Charts
- Python Scatter Plots
- Python Heat Maps
- Python Box Plots
- Python Chart Styling
- Python Chart Properties
Statistical Data Analysis
- Python Linear Regression
- Python Chi-square Test
- Python Correlation
- Python P-Value
- Python Bernoulli Distribution
- Python Poisson Distribution
- Python Binomial Distribution
- Python Normal Distribution
- Python Measuring Variance
- Python Measuring Central Tendency
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Python - Processing XLS Data
Microsoft Excel is a very widely used spread sheet program. Its user friendpness and appeapng features makes it a very frequently used tool in Data Science. The Panadas pbrary provides features using which we can read the Excel file in full as well as in parts for only a selected group of Data. We can also read an Excel file with multiple sheets in it. We use the read_excel function to read the data from it.
Input as Excel File
We Create an excel file with multiple sheets in the windows OS. The Data in the different sheets is as shown below.
You can create this file using the Excel Program in windows OS. Save the file as input.xlsx.
# Data in Sheet1 id,name,salary,start_date,dept 1,Rick,623.3,2012-01-01,IT 2,Dan,515.2,2013-09-23,Operations 3,Tusar,611,2014-11-15,IT 4,Ryan,729,2014-05-11,HR 5,Gary,843.25,2015-03-27,Finance 6,Rasmi,578,2013-05-21,IT 7,Pranab,632.8,2013-07-30,Operations 8,Guru,722.5,2014-06-17,Finance # Data in Sheet2 id name zipcode 1 Rick 301224 2 Dan 341255 3 Tusar 297704 4 Ryan 216650 5 Gary 438700 6 Rasmi 665100 7 Pranab 341211 8 Guru 347480
Reading an Excel File
The read_excel function of the pandas pbrary is used read the content of an Excel file into the python environment as a pandas DataFrame. The function can read the files from the OS by using proper path to the file. By default, the function will read Sheet1.
import pandas as pd data = pd.read_excel( path/input.xlsx ) print (data)
When we execute the above code, it produces the following result. Please note how an additional column starting with zero as a index has been created by the function.
id name salary start_date dept 0 1 Rick 623.30 2012-01-01 IT 1 2 Dan 515.20 2013-09-23 Operations 2 3 Tusar 611.00 2014-11-15 IT 3 4 Ryan 729.00 2014-05-11 HR 4 5 Gary 843.25 2015-03-27 Finance 5 6 Rasmi 578.00 2013-05-21 IT 6 7 Pranab 632.80 2013-07-30 Operations 7 8 Guru 722.50 2014-06-17 Finance
Reading Specific Columns and Rows
Similar to what we have already seen in the previous chapter to read the CSV file, the read_excel function of the pandas pbrary can also be used to read some specific columns and specific rows. We use the multi-axes indexing method called .loc() for this purpose. We choose to display the salary and name column for some of the rows.
import pandas as pd data = pd.read_excel( path/input.xlsx ) # Use the multi-axes indexing funtion print (data.loc[[1,3,5],[ salary , name ]])
When we execute the above code, it produces the following result.
salary name 1 515.2 Dan 3 729.0 Ryan 5 578.0 Rasmi
Reading Multiple Excel Sheets
Multiple sheets with different Data formats can also be read by using read_excel function with help of a wrapper class named ExcelFile. It will read the multiple sheets into memory only once. In the below example we read sheet1 and sheet2 into two data frames and print them out inspanidually.
import pandas as pd with pd.ExcelFile( C:/Users/Rasmi/Documents/pydatasci/input.xlsx ) as xls: df1 = pd.read_excel(xls, Sheet1 ) df2 = pd.read_excel(xls, Sheet2 ) print("****Result Sheet 1****") print (df1[0:5][ salary ]) print("") print("***Result Sheet 2****") print (df2[0:5][ zipcode ])
When we execute the above code, it produces the following result.
****Result Sheet 1**** 0 623.30 1 515.20 2 611.00 3 729.00 4 843.25 Name: salary, dtype: float64 ***Result Sheet 2**** 0 301224 1 341255 2 297704 3 216650 4 438700 Name: zipcode, dtype: int64Advertisements