English 中文(简体)
Analyzing Time Series Data
  • 时间:2024-12-22

AI with Python – Analyzing Time Series Data


Previous Page Next Page  

Predicting the next in a given input sequence is another important concept in machine learning. This chapter gives you a detailed explanation about analyzing time series data.

Introduction

Time series data means the data that is in a series of particular time intervals. If we want to build sequence prediction in machine learning, then we have to deal with sequential data and time. Series data is an abstract of sequential data. Ordering of data is an important feature of sequential data.

Basic Concept of Sequence Analysis or Time Series Analysis

Sequence analysis or time series analysis is to predict the next in a given input sequence based on the previously observed. The prediction can be of anything that may come next: a symbol, a number, next day weather, next term in speech etc. Sequence analysis can be very handy in apppcations such as stock market analysis, weather forecasting, and product recommendations.

Example

Consider the following example to understand sequence prediction. Here A,B,C,D are the given values and you have to predict the value E using a Sequence Prediction Model.

sequence prediction model

Instalpng Useful Packages

For time series data analysis using Python, we need to install the following packages −

Pandas

Pandas is an open source BSD-pcensed pbrary which provides high-performance, ease of data structure usage and data analysis tools for Python. You can install Pandas with the help of the following command −

pip install pandas

If you are using Anaconda and want to install by using the conda package manager, then you can use the following command −

conda install -c anaconda pandas

hmmlearn

It is an open source BSD-pcensed pbrary which consists of simple algorithms and models to learn Hidden Markov Models(HMM) in Python. You can install it with the help of the following command −

pip install hmmlearn

If you are using Anaconda and want to install by using the conda package manager, then you can use the following command −

conda install -c omnia hmmlearn

PyStruct

It is a structured learning and prediction pbrary. Learning algorithms implemented in PyStruct have names such as conditional random fields(CRF), Maximum-Margin Markov Random Networks (M3N) or structural support vector machines. You can install it with the help of the following command −

pip install pystruct

CVXOPT

It is used for convex optimization based on Python programming language. It is also a free software package. You can install it with the help of following command −

pip install cvxopt

If you are using Anaconda and want to install by using the conda package manager, then you can use the following command −

conda install -c anaconda cvdoxt

Pandas: Handpng, Spcing and Extracting Statistic from Time Series Data

Pandas is a very useful tool if you have to work with time series data. With the help of Pandas, you can perform the following −

    Create a range of dates by using the pd.date_range package

    Index pandas with dates by using the pd.Series package

    Perform re-samppng by using the ts.resample package

    Change the frequency

Example

The following example shows you handpng and spcing the time series data by using Pandas. Note that here we are using the Monthly Arctic Oscillation data, which can be downloaded from monthly.ao.index.b50.current.ascii and can be converted to text format for our use.

Handpng time series data

For handpng time series data, you will have to perform the following steps −

The first step involves importing the following packages −

import numpy as np
import matplotpb.pyplot as plt
import pandas as pd

Next, define a function which will read the data from the input file, as shown in the code given below −

def read_data(input_file):
   input_data = np.loadtxt(input_file, depmiter = None)

Now, convert this data to time series. For this, create the range of dates of our time series. In this example, we keep one month as frequency of data. Our file is having the data which starts from January 1950.

dates = pd.date_range( 1950-01 , periods = input_data.shape[0], freq =  M )

In this step, we create the time series data with the help of Pandas Series, as shown below −

output = pd.Series(input_data[:, index], index = dates)	
return output
	
if __name__== __main__ :

Enter the path of the input file as shown here −

input_file = "/Users/admin/AO.txt"

Now, convert the column to timeseries format, as shown here −

timeseries = read_data(input_file)

Finally, plot and visuapze the data, using the commands shown −

plt.figure()
timeseries.plot()
plt.show()

You will observe the plots as shown in the following images −

Test Series

Plots

Spcing time series data

Spcing involves retrieving only some part of the time series data. As a part of the example, we are spcing the data only from 1980 to 1990. Observe the following code that performs this task −

timeseries[ 1980 : 1990 ].plot()
   <matplotpb.axes._subplots.AxesSubplot at 0xa0e4b00>

plt.show()

When you run the code for spcing the time series data, you can observe the following graph as shown in the image here −

Spcing Time Series Data

Extracting Statistic from Time Series Data

You will have to extract some statistics from a given data, in cases where you need to draw some important conclusion. Mean, variance, correlation, maximum value, and minimum value are some of such statistics. You can use the following code if you want to extract such statistics from a given time series data −

Mean

You can use the mean() function, for finding the mean, as shown here −

timeseries.mean()

Then the output that you will observe for the example discussed is −

-0.11143128165238671

Maximum

You can use the max() function, for finding maximum, as shown here −

timeseries.max()

Then the output that you will observe for the example discussed is −

3.4952999999999999

Minimum

You can use the min() function, for finding minimum, as shown here −

timeseries.min()

Then the output that you will observe for the example discussed is −

-4.2656999999999998

Getting everything at once

If you want to calculate all statistics at a time, you can use the describe() function as shown here −

timeseries.describe()

Then the output that you will observe for the example discussed is −

count   817.000000
mean     -0.111431
std       1.003151
min      -4.265700
25%      -0.649430
50%      -0.042744
75%       0.475720
max       3.495300
dtype: float64

Re-samppng

You can resample the data to a different time frequency. The two parameters for performing re-samppng are −

    Time period

    Method

Re-samppng with mean()

You can use the following code to resample the data with the mean()method, which is the default method −

timeseries_mm = timeseries.resample("A").mean()
timeseries_mm.plot(style =  g-- )
plt.show()

Then, you can observe the following graph as the output of resamppng using mean() −

Re Samppng with Mean Method

Re-samppng with median()

You can use the following code to resample the data using the median()method −

timeseries_mm = timeseries.resample("A").median()
timeseries_mm.plot()
plt.show()

Then, you can observe the following graph as the output of re-samppng with median() −

Re Samppng with Median Method

Rolpng Mean

You can use the following code to calculate the rolpng (moving) mean −

timeseries.rolpng(window = 12, center = False).mean().plot(style =  -g )
plt.show()

Then, you can observe the following graph as the output of the rolpng (moving) mean −

Rolpng Mean

Analyzing Sequential Data by Hidden Markov Model (HMM)

HMM is a statistic model which is widely used for data having continuation and extensibipty such as time series stock market analysis, health checkup, and speech recognition. This section deals in detail with analyzing sequential data using Hidden Markov Model (HMM).

Hidden Markov Model (HMM)

HMM is a stochastic model which is built upon the concept of Markov chain based on the assumption that probabipty of future stats depends only on the current process state rather any state that preceded it. For example, when tossing a coin, we cannot say that the result of the fifth toss will be a head. This is because a coin does not have any memory and the next result does not depend on the previous result.

Mathematically, HMM consists of the following variables −

States (S)

It is a set of hidden or latent states present in a HMM. It is denoted by S.

Output symbols (O)

It is a set of possible output symbols present in a HMM. It is denoted by O.

State Transition Probabipty Matrix (A)

It is the probabipty of making transition from one state to each of the other states. It is denoted by A.

Observation Emission Probabipty Matrix (B)

It is the probabipty of emitting/observing a symbol at a particular state. It is denoted by B.

Prior Probabipty Matrix (Π)

It is the probabipty of starting at a particular state from various states of the system. It is denoted by Π.

Hence, a HMM may be defined as ? = (S,O,A,B,?),

where,

    S = {s1,s2,…,sN} is a set of N possible states,

    O = {o1,o2,…,oM} is a set of M possible observation symbols,

    A is an N?N state Transition Probabipty Matrix (TPM),

    B is an N?M observation or Emission Probabipty Matrix (EPM),

    π is an N dimensional initial state probabipty distribution vector.

Example: Analysis of Stock Market data

In this example, we are going to analyze the data of stock market, step by step, to get an idea about how the HMM works with sequential or time series data. Please note that we are implementing this example in Python.

Import the necessary packages as shown below −

import datetime
import warnings

Now, use the stock market data from the matpotpb.finance package, as shown here −

import numpy as np
from matplotpb import cm, pyplot as plt
from matplotpb.dates import YearLocator, MonthLocator
try:
   from matplotpb.finance import quotes_historical_yahoo_och1
except ImportError:
   from matplotpb.finance import (
      quotes_historical_yahoo as quotes_historical_yahoo_och1)

from hmmlearn.hmm import GaussianHMM

Load the data from a start date and end date, i.e., between two specific dates as shown here −

start_date = datetime.date(1995, 10, 10)
end_date = datetime.date(2015, 4, 25)
quotes = quotes_historical_yahoo_och1( INTC , start_date, end_date)

In this step, we will extract the closing quotes every day. For this, use the following command −

closing_quotes = np.array([quote[2] for quote in quotes])

Now, we will extract the volume of shares traded every day. For this, use the following command −

volumes = np.array([quote[5] for quote in quotes])[1:]

Here, take the percentage difference of closing stock prices, using the code shown below −

diff_percentages = 100.0 * np.diff(closing_quotes) / closing_quotes[:-]
dates = np.array([quote[0] for quote in quotes], dtype = np.int)[1:]
training_data = np.column_stack([diff_percentages, volumes])

In this step, create and train the Gaussian HMM. For this, use the following code −

hmm = GaussianHMM(n_components = 7, covariance_type =  diag , n_iter = 1000)
with warnings.catch_warnings():
   warnings.simplefilter( ignore )
   hmm.fit(training_data)

Now, generate data using the HMM model, using the commands shown −

num_samples = 300
samples, _ = hmm.sample(num_samples)

Finally, in this step, we plot and visuapze the difference percentage and volume of shares traded as output in the form of graph.

Use the following code to plot and visuapze the difference percentages −

plt.figure()
plt.title( Difference percentages )
plt.plot(np.arange(num_samples), samples[:, 0], c =  black )

Use the following code to plot and visuapze the volume of shares traded −

plt.figure()
plt.title( Volume of shares )
plt.plot(np.arange(num_samples), samples[:, 1], c =  black )
plt.ypm(ymin = 0)
plt.show()
Advertisements