English 中文(简体)
Julia - Data Frames
  • 时间:2024-11-03

Jupa Programming - Data Frames


Previous Page Next Page  

DataFrame may be defined as a table or spreadsheet which we can be used to sort as well as explore a set of related data values. In other words, we can call it a smarter array for holding tabular data. Before we use it, we need to download and install DataFrame and CSV packages as follows −


(@v1.5) pkg> add DataFrames
(@v1.5) pkg> add CSV

To start using the DataFrames package, type the following command −


jupa> using DataFrames

Loading data into DataFrames

There are several ways to create new DataFrames (which we will discuss later in this section) but one of the quickest ways to load data into DataFrames is to load the Anscombe dataset. For better understanding, let us see the example below −


anscombe = DataFrame(
         [10 10 10 8 8.04 9.14 7.46 6.58;
           8 8   8 8 6.95 8.14 6.77 5.76;
          13 13 13 8 7.58 8.74 12.74 7.71;
           9 9   9 8 8.81 8.77 7.11 8.84;
          11 11 11 8 8.33 9.26 7.81 8.47;
          14 14 14 8 9.96 8.1  8.84 7.04;
           6 6   6 8 7.24 6.13 6.08 5.25;
           4 4   4 19 4.26 3.1 5.39 12.5;
          12 12 12 8 10.84 9.13 8.15 5.56;
           7 7   7 8 4.82  7.26 6.42 7.91;
           5 5   5 8 5.68 4.74 5.73 6.89]);


jupa> rename!(anscombe, [Symbol.(:N, 1:4); Symbol.(:M, 1:4)])
11×8 DataFrame
│ Row │    N1   │    N2   │   N3    │    N4   │    M1   │     M2  │     M3  │    M4   │
│     │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │ 10.0    │ 10.0    │ 10.0    │ 8.0     │ 8.04    │ 9.14    │ 7.46    │ 6.58    │
│ 2   │ 8.0     │ 8.0     │ 8.0     │ 8.0     │ 6.95    │ 8.14    │ 6.77    │ 5.76    │
│ 3   │ 13.0    │ 13.0    │ 13.0    │ 8.0     │ 7.58    │ 8.74    │ 12.74   │ 7.71    │
│ 4   │ 9.0     │ 9.0     │ 9.0     │ 8.0     │ 8.81    │ 8.77    │ 7.11    │ 8.84    │
│ 5   │ 11.0    │ 11.0    │ 11.0    │ 8.0     │ 8.33    │ 9.26    │ 7.81    │ 8.47    │
│ 6   │ 14.0    │ 14.0    │ 14.0    │ 8.0     │ 9.96    │ 8.1     │ 8.84    │ 7.04    │
│ 7   │ 6.0     │ 6.0     │ 6.0     │ 8.0     │ 7.24    │ 6.13    │ 6.08    │ 5.25    │
│ 8   │ 4.0     │ 4.0     │ 4.0     │ 19.0    │ 4.26    │ 3.1     │ 5.39    │ 12.5    │
│ 9   │ 12.0    │ 12.0    │ 12.0    │ 8.0     │ 10.84   │ 9.13    │ 8.15    │ 5.56    │
│10   │ 7.0     │ 7.0     │ 7.0     │ 8.0     │ 4.82    │ 7.26    │ 6.42    │ 7.91    │
│11   │ 5.0     │ 5.0     │ 5.0     │ 8.0     │ 5.68    │ 4.74    │ 5.73    │ 6.89    │

We assigned the DataFrame to a variable named Anscombe, convert them to an array and then rename columns.

Collected Datasets

We can also use another dataset package named RDatasets package. It contains several other famous datasets including Anscombe’s. Before we start using it, we need to first download and install it as follows −


(@v1.5) pkg> add RDatasets

To start using this package, type the following command −


jupa> using DataFrames
jupa> anscombe = dataset("datasets","anscombe")
11×8 DataFrame
│ Row │   X1  │   X2  │   X3  │   X4  │   Y1    │    Y2   │    Y3   │   Y4    │
│     │ Int64 │ Int64 │ Int64 │ Int64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼───────┼───────┼───────┼───────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │  10   │   10  │    10 │  8    │   8.04  │  9.14   │    7.46 │ 6.58    │
│ 2   │  8    │   8   │    8  │  8    │   6.95  │  8.14   │    6.77 │ 5.76    │
│ 3   │  13   │   13  │    13 │  8    │   7.58  │  8.74   │    12.74│ 7.71    │
│ 4   │  9    │   9   │    9  │  8    │   8.81  │  8.77   │    7.11 │ 8.84    │
│ 5   │  11   │   11  │    11 │  8    │   8.33  │  9.26   │    7.81 │ 8.47    │
│ 6   │  14   │   14  │    14 │  8    │   9.96  │  8.1    │    8.84 │ 7.04    │
│ 7   │  6    │   6   │    6  │  8    │   7.24  │  6.13   │    6.08 │ 5.25    │
│ 8   │  4    │   4   │    4  │  19   │   4.26  │  3.1    │    5.39 │ 12.5    │
│ 9   │  12   │   12  │    12 │  8    │   10.84 │  9.13   │    8.15 │ 5.56    │
│ 10  │  7    │   7   │    7  │  8    │   4.82  │  7.26   │    6.42 │ 7.91    │
│ 11  │  5    │   5   │    5  │  8    │   5.68  │  4.74   │    5.73 │ 6.89    │

Empty DataFrames

We can also create DataFrames by simply providing the information about rows, columns as we give in an array.

Example


jupa> empty_df = DataFrame(X = 1:10, Y = 21:30)
10×2 DataFrame
│ Row │   X   │   Y   │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │    1  │   21  │
│ 2   │    2  │   22  │
│ 3   │    3  │   23  │
│ 4   │    4  │   24  │
│ 5   │    5  │   25  │
│ 6   │    6  │   26  │
│ 7   │    7  │   27  │
│ 8   │    8  │   28  │
│ 9   │    9  │   29  │
│ 10  │   10  │   30  │

To create completely empty DataFrame, we only need to supply the Column Names and define their types as follows −


jupa> Complete_empty_df = DataFrame(Name=String[],
         W=Float64[],
         H=Float64[],
         M=Float64[],
         V=Float64[])
0×5 DataFrame


jupa> Complete_empty_df = vcat(Complete_empty_df, DataFrame(Name="EmptyTestDataFrame", W=5.0, H=5.0, M=3.0, V=5.0))
1×5 DataFrame
│ Row │     Name           │     W   │    H    │   M     │     V   │
│     │     String         │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │ EmptyTestDataFrame │  5.0    │    5.0  │   3.0   │     5.0 │


jupa> Complete_empty_df = vcat(Complete_empty_df, DataFrame(Name="EmptyTestDataFrame2", W=6.0, H=6.0, M=5.0, V=7.0))
2×5 DataFrame
│ Row │        Name         │    W    │   H     │     M   │    V    │
│     │        String       │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │ EmptyTestDataFrame  │    5.0  │   5.0   │  3.0    │    5.0  │
│ 2   │ EmptyTestDataFrame2 │    6.0  │   6.0   │  5.0    │    7.0  │

Plotting Anscombe’s Quarter

Now the Anscombe dataset has been loaded, we can do some statistics with it also. The inbuilt function named describe() enables us to calculate the statistics properties of the columns of a dataset. You can supply the symbols, given below, for the properties −

    mean

    std

    min

    q25

    median

    q75

    max

    eltype

    nunique

    first

    last

    nmissing

Example


jupa> describe(anscombe, :mean, :std, :min, :median, :q25)
8×6 DataFrame
│ Row │ variable │   mean  │     std │ min  │  median │   q25   │
│     │  Symbol  │ Float64 │ Float64 │ Real │ Float64 │ Float64 │
├─────┼──────────┼─────────┼─────────┼──────┼─────────┼─────────┤
│  1  │     X1   │   9.0   │ 3.31662 │    4 │   9.0   │  6.5    │
│  2  │     X2   │   9.0   │ 3.31662 │    4 │   9.0   │  6.5    │
│  3  │     X3   │   9.0   │ 3.31662 │    4 │   9.0   │  6.5    │
│  4  │     X4   │   9.0   │ 3.31662 │    8 │   8.0   │  8.0    │
│  5  │     Y1   │ 7.50091 │ 2.03157 │ 4.26 │   7.58  │ 6.315   │
│  6  │     Y2   │ 7.50091 │ 2.03166 │ 3.1  │   8.14  │ 6.695   │
│  7  │     Y3   │   7.5   │ 2.03042 │ 5.39 │   7.11  │ 6.25    │
│  8  │     Y4   │ 7.50091 │ 2.03058 │ 5.25 │   7.04  │ 6.17    │

We can also do a comparison between XY datasets as follows −


jupa> [describe(anscombe[:, xy], :mean, :std, :median, :q25) for xy in [[:X1, :Y1], [:X2, :Y2], [:X3, :Y3], [:X4, :Y4]]]
4-element Array{DataFrame,1}:
2×5 DataFrame
│ Row │ variable │   mean  │     std │  median │   q25   │
│     │  Symbol  │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼──────────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │   X1     │   9.0   │ 3.31662 │    9.0  │   6.5   │
│ 2   │  Y1      │ 7.50091 │ 2.03157 │    7.58 │ 6.315   │
2×5 DataFrame
│ Row │ variable │   mean  │  std    │ median  │   q25   │
│     │  Symbol  │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼──────────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │     X2   │    9.0  │ 3.31662 │    9.0  │     6.5 │
│ 2   │     Y2   │ 7.50091 │ 2.03166 │    8.14 │   6.695 │
2×5 DataFrame
│ Row │ variable │   mean  │     std │  median │   q25   │
│     │  Symbol  │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼──────────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │     X3   │     9.0 │ 3.31662 │    9.0  │   6.5   │
│ 2   │     Y3   │     7.5 │ 2.03042 │    7.11 │   6.25  │
2×5 DataFrame
│ Row │ variable │   mean  │  std    │  median │   q25   │
│     │  Symbol  │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼──────────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │     X4   │   9.0   │ 3.31662 │    8.0  │   8.0   │
│ 2   │     Y4   │ 7.50091 │ 2.03058 │    7.04 │   6.17  │

Let us reveal the true purpose of Anscombe, i.e., plot the four sets of its quartet as follows −


jupa> using StatsPlots
[ Info: Precompipng StatsPlots [f3b207a7-027a-5e70-b257-86293d7955fd]

jupa> @df anscombe scatter([:X1 :X2 :X3 :X4], [:Y1 :Y2 :Y3 :Y4],
            smooth=true,
            pne = :red,
            pnewidth = 2,
            title= ["X$i vs Y$i" for i in (1:4) ],
            legend = false,
            layout = 4,
            xpmits = (2, 20),
            ypmits = (2, 14))
Quarter

Regression and Models

In this section, we will be working with Linear Regression pne for the dataset. For this we need to use Generapzed Linear Model (GLM) package which you need to first add as follows −


(@v1.5) pkg> add GLM

Now let us create a pner regression model by specifying a formula using the @formula macro and supplying columns names as well as name of the DataFrame. An example for the same is given below −


jupa> pnearregressionmodel = fit(LinearModel, @formula(Y1 ~ X1), anscombe)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}


Y1 ~ 1 + X1

Coefficients:
───────────────────────────────────────────────────────────────────────
               Coef.    Std.     Error    t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept) 3.00009  1.12475     2.67     0.0257      0.455737  5.54444
   X1       0.500091 0.117906    4.24     0.0022      0.23337   0.766812
───────────────────────────────────────────────────────────────────────

Let us check the summary and the coefficient of the above created pnear regression model −


jupa> summary(pnearregressionmodel)
"StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}"

jupa> coef(pnearregressionmodel)
2-element Array{Float64,1}:
 3.0000909090909054
 0.5000909090909096

Now let us produce a function for the regression pne. The form of the function is y = ax +c.


jupa> f(x) = coef(pnearmodel)[2] * x + coef(pnearmodel)[1]
f (generic function with 1 method)

Once we have the function that describes the regression pne, we can draw a plot as follows −


jupa> p1 = plot(anscombe[:X1], anscombe[:Y1],
            smooth=true,
            seriestype=:scatter,
            title = "X1 vs Y1",
            pnewidth=8,
            pnealpha=0.5,
            label="data")
            
jupa> plot!(f, 2, 20, label="correlation")
Regression and Models

Working with DataFrames

As we know that nothing is perfect. This is also true in case of datasets because not all the datasets are consistent and tidy. To show how we can work with different items of DataFrame, let us create a test DataFrame −


jupa> testdf = DataFrame( Number = [3, 5, 7, 8, 20 ],
                             Name = ["Lithium", "Boron", "Nitrogen", "Oxygen", "Calcium" ],
                     AtomicWeight = [6.941, 10.811, 14.0067, 15.9994, 40.078 ],
                           Symbol = ["Li", "B", "N", "O", "Ca" ],
                       Discovered = [1817, 1808, 1772, 1774, missing ])
5×5 DataFrame
│ Row │ Number │     Name │ AtomicWeight │ Symbol │ Discovered │
│     │  Int64 │  String  │      Float64 │ String │   Int64?   │
├─────┼────────┼──────────┼──────────────┼────────┼────────────┤
│  1  │     3  │ Lithium  │      6.941   │     Li │   1817     │
│  2  │     5  │  Boron   │      10.811  │     B  │   1808     │
│  3  │     7  │ Nitrogen │      14.0067 │     N  │   1772     │
│  4  │     8  │  Oxygen  │      15.9994 │     O  │   1774     │
│  5  │     20 │ Calcium  │      40.078  │     Ca │   missing  │                     

Handpng missing values

There can be some missing values in datasets. It can be checked with the help of describe() function as follows −


jupa> describe(testdf)
5×8 DataFrame
│ Row │     variable │  mean   │ min   │ median  │    max │ nunique │ nmissing │        eltype         │
│     │     Symbol   │  Union… │ Any   │ Union…  │    Any │ Union…  │ Union…   │        Type           │
├─────┼──────────────┼─────────┼───────┼─────────┼────────┼─────────┼──────────┼───────────────────────┤
│  1  │     Number   │  8.6    │    3  │  7.0    │    20  │         │          │       Int64           │
│  2  │     Name     │         │ Boron │         │ Oxygen │    5    │          │       String          │
│  3  │ AtomicWeight │ 17.5672 │ 6.941 │ 14.0067 │ 40.078 │         │          │       Float64         │
│  4  │ Symbol       │         │    B  │         │      O │    5    │          │       String          │
│  5  │ Discovered   │ 1792.75 │ 1772  │ 1791.0  │   1817 │         │     1    │ Union{Missing, Int64} │

Jupa provides a special datatype called Missing to address such issue. This datatype indicates that there is not a usable value at this location. That is why the DataFrames packages allow us to get most of our datasets and make sure that the calculations are not tampered due to missing values.

Looking for missing values

We can check with ismissing() function that whether the DataFrame has any missing value or not.

Example


jupa> for row in 1:nrows
            for col in 1:ncols
               if ismissing(testdf [row,col])
                  println("$(names(testdf)[col]) value for $(testdf[row,:Name]) is missing!")
               end
            end
         end

Discovered value for Calcium is missing!

Repairing DataFrames

We can use the following code to change values that are not acceptable pke “n/a”, “0”, “missing”. The below code will look in every cell for above mentioned non-acceptable values.

Example


jupa> for row in 1:size(testdf, 1) # or nrow(testdf)
         for col in 1:size(testdf, 2) # or ncol(testdf)
            println("processing row $row column $col ")
            temp = testdf [row,col]
            if ismissing(temp)
               println("skipping missing")
            elseif temp == "n/a" || temp == "0" || temp == 0
               testdf [row, col] = missing
               println("changed row $row column $col ")
            end
         end
      end
processing row 1 column 1
processing row 1 column 2
processing row 1 column 3
processing row 1 column 4
processing row 1 column 5
processing row 2 column 1
processing row 2 column 2
processing row 2 column 3
processing row 2 column 4
processing row 2 column 5
processing row 3 column 1
processing row 3 column 2
processing row 3 column 3
processing row 3 column 4
processing row 3 column 5
processing row 4 column 1
processing row 4 column 2
processing row 4 column 3
processing row 4 column 4
processing row 4 column 5
processing row 5 column 1
processing row 5 column 2
processing row 5 column 3
processing row 5 column 4
processing row 5 column 5
skipping missing

Working with missing values

Jupa provides support for representing missing values in the statistical sense, that is for situations where no value is available for a variable in an observation, but a vapd value theoretically exists.

completecases()

The completecases() function is used to find the maximum value of the column that contains the missing value.

Example


jupa> maximum(testdf[completecases(testdf), :].Discovered)
1817

dropmissing()

The dropmissing() function is used to get the copy of DataFrames without having the missing values.

Example


jupa> dropmissing(testdf)
4×5 DataFrame
│ Row │ Number │    Name  │ AtomicWeight │ Symbol │ Discovered │
│     │ Int64  │  String  │   Float64    │ String │    Int64   │
├─────┼────────┼──────────┼──────────────┼────────┼────────────┤
│  1  │  3     │ Lithium  │     6.941    │   Li   │   1817     │
│  2  │  5     │  Boron   │   10.811     │    B   │   1808     │
│  3  │  7     │ Nitrogen │   14.0067    │    N   │   1772     │
│  4  │  8     │  Oxygen  │   15.9994    │    O   │   1774     │

Modifying DataFrames

The DataFrames package of Jupa provides various methods using which you can add, remove, rename columns, and add/delete rows.

Adding Columns

We can use hcat() function to add a column of integers to the DataFrame. It can be used as follows −


jupa> hcat(testdf, axes(testdf, 1))
5×6 DataFrame
│ Row │ Number │     Name │ AtomicWeight │ Symbol │ Discovered │  x1   │
│     │ Int64  │   String │   Float64    │ String │   Int64?   │ Int64 │
├─────┼────────┼──────────┼──────────────┼────────┼────────────┼───────┤
│  1  │  3     │  Lithium │      6.941   │   Li   │   1817     │   1   │
│  2  │  5     │    Boron │      10.811  │   B    │    1808    │   2   │
│  3  │  7     │ Nitrogen │      14.0067 │   N    │    1772    │   3   │
│  4  │  8     │   Oxygen │      15.9994 │   O    │   1774     │   4   │
│  5  │  20    │  Calcium │      40.078  │   Ca   │   missing  │   5   │

But as you can notice that we haven’t changed the DataFrame or assigned any new DataFrame to a symbol. We can add another column as follows −


jupa> testdf [!, :MP] = [180.7, 2300, -209.86, -222.65, 839]
5-element Array{Float64,1}:
 180.7
 2300.0
 -209.86
 -222.65
 839.0
jupa> testdf
5×6 DataFrame
│ Row │ Number │    Name  │ AtomicWeight │ Symbol │ Discovered │   MP    │
│     │  Int64 │  String  │   Float64    │ String │   Int64?   │ Float64 │
├─────┼────────┼──────────┼──────────────┼────────┼────────────┼─────────┤
│  1  │     3  │  Lithium │      6.941   │   Li   │    1817    │ 180.7   │
│  2  │     5  │  Boron   │      10.811  │    B   │    1808    │ 2300.0  │
│  3  │     7  │ Nitrogen │      14.0067 │   N    │    1772    │ -209.86 │
│  4  │     8  │   Oxygen │      15.9994 │   O    │    1774    │ -222.65 │
│  5  │     20 │  Calcium │      40.078  │   Ca   │    missing │ 839.0   │

We have added a column having melting points of all the elements to our test DataFrame.

Removing Columns

We can use select!() function to remove a column from the DataFrame. It will create a new DataFrame that contains the selected columns, hence to remove a particular column, we need to use select!() with Not. It is shown in the given example −


jupa> select!(testdf, Not(:MP))
5×5 DataFrame
│ Row │ Number │  Name    │ AtomicWeight │ Symbol │ Discovered │
│     │ Int64  │ String   │   Float64    │ String │   Int64?   │
├─────┼────────┼──────────┼──────────────┼────────┼────────────┤
│  1  │     3  │ Lithium  │      6.941   │   Li   │   1817     │
│  2  │     5  │  Boron   │      10.811  │   B    │   1808     │
│  3  │     7  │ Nitrogen │      14.0067 │   N    │   1772     │
│  4  │     8  │  Oxygen  │      15.9994 │   O    │   1774     │
│  5  │     20 │ Calcium  │      40.078  │   Ca   │   missing  │

We have removed the column MP from our Data Frame.

Renaming Columns

We can use rename!() function to rename a column in the DataFrame. We will be renaming the AtomicWeight column to AW as follows −


jupa> rename!(testdf, :AtomicWeight => :AW)
5×5 DataFrame
│ Row │ Number │ Name     │   AW    │ Symbol │ Discovered │
│     │ Int64  │ String   │ Float64 │ String │  Int64?    │
├─────┼────────┼──────────┼─────────┼────────┼────────────┤
│  1  │   3    │  Lithium │ 6.941   │  Li    │     1817   │
│  2  │   5    │    Boron │ 10.811  │  B     │     1808   │
│  3  │   7    │ Nitrogen │ 14.0067 │  N     │     1772   │
│  4  │   8    │  Oxygen  │ 15.9994 │  O     │     1774   │
│  5  │  20    │  Calcium │ 40.078  │  Ca    │  missing   │

Adding rows

We can use push!() function with suitable data to add rows in the DataFrame. In the below given example we will be adding a row having element Cooper −

Example


jupa> push!(testdf, [29, "Copper", 63.546, "Cu", missing])
6×5 DataFrame
│ Row │ Number │    Name  │      AW │ Symbol │ Discovered │
│     │  Int64 │  String  │ Float64 │ String │    Int64?  │
├─────┼────────┼──────────┼─────────┼────────┼────────────┤
│  1  │     3  │  Lithium │   6.941 │     Li │     1817   │
│  2  │     5  │    Boron │  10.811 │      B │     1808   │
│  3  │     7  │ Nitrogen │ 14.0067 │      N │      1772  │
│  4  │     8  │  Oxygen  │ 15.9994 │      O │      1774  │
│  5  │     20 │  Calcium │  40.078 │     Ca │   missing  │
│  6  │     29 │  Copper  │  63.546 │     Cu │  missing   │

Deleting rows

We can use deleterows!() function with suitable data to delete rows from the DataFrame. In the below given example we will be deleting three rows (4th, 5th,and 6th) from our test data frame −

Example


jupa> deleterows!(testdf, 4:6)
3×5 DataFrame
│ Row │ Number │  Name    │   AW    │ Symbol │ Discovered │
│     │  Int64 │ String   │ Float64 │ String │  Int64?    │
├─────┼────────┼──────────┼─────────┼────────┼────────────┤
│  1  │  3     │  Lithium │   6.941 │  Li    │  1817      │
│  2  │  5     │    Boron │  10.811 │  B     │  1808      │
│  3  │  7     │ Nitrogen │ 14.0067 │  N     │  1772      │

Finding values in DataFrame

To find the values in DataFrame, we need to use an elementwise operator examining all the rows. This operator will return an array of Boolean values to indicate whether cells meet the criteria or not.

Example


jupa> testdf[:, :AW] .< 10
3-element BitArray{1}:
1
0
0

jupa> testdf[testdf[:, :AW] .< 10, :]
1×5 DataFrame
│ Row │ Number │   Name  │      AW │ Symbol │ Discovered │
│     │  Int64 │  String │ Float64 │ String │   Int64?   │
├─────┼────────┼─────────┼─────────┼────────┼────────────┤
│ 1   │    3   │ Lithium │ 6.941   │   Li   │   1817     │

Sorting

To sort the values in DataFrame, we can use sort!() function. We need to give the columns on which we want to sort.

Example


jupa> sort!(testdf, [order(:AW)])
3×5 DataFrame
│ Row │ Number │  Name    │   AW    │ Symbol │ Discovered │
│     │  Int64 │ String   │ Float64 │ String │  Int64?    │
├─────┼────────┼──────────┼─────────┼────────┼────────────┤
│ 1   │  3     │ Lithium  │   6.941 │  Li    │     1817   │
│ 2   │  5     │  Boron   │  10.811 │  B     │     1808   │
│ 3   │  7     │ Nitrogen │ 14.0067 │  N     │     1772   │

The DataFrame is sorted based on the values of column AW.

Advertisements