Maxkit: Numerical and Scientific Computation in Julia

julia 提供處理大量資料，以及進行統計運算的 library

Working with data

首先是讀取資料的方式，如果是 binary 資料，可直接用 read(), write()

julia> for val in [:Bool, :Char, :Int8]
         @eval println(@which read(stdin, $val))
       end
read(s::IO, ::Type{Bool}) in Base at io.jl:593
read(io::IO, ::Type{Char}) in Base at io.jl:613
read(s::IO, ::Type{Int8}) in Base at io.jl:588

使用

julia> read(stdin, Char)
j
'j': ASCII/Unicode U+006a (category Ll: Letter, lowercase)

julia> read(stdin, Bool)
true
true

julia> read(stdin, Int8)
12
49

readline

julia> methods(readline)
# 4 methods for generic function "readline":
[1] readline(s::IOStream; keep) in Base at iostream.jl:433
[2] readline() in Base at io.jl:370
[3] readline(filename::AbstractString; keep) in Base at io.jl:364
[4] readline(s::IO; keep) in Base at io.jl:370

使用

julia> number = parse(Int64, readline())
12
12

julia> println(number)
12

working with text files

open 參數除了 filename，後面還有存取該檔案的 mode

julia> methods(open)
# 8 methods for generic function "open":
[1] open(fname::AbstractString; read, write, create, truncate, append) in Base at iostream.jl:275
[2] open(fname::AbstractString, mode::AbstractString) in Base at iostream.jl:339
[3] open(f::Function, cmds::Base.AbstractCmd, args...) in Base at process.jl:615
[4] open(f::Function, args...; kwargs...) in Base at iostream.jl:367
[5] open(cmds::Base.AbstractCmd) in Base at process.jl:584
[6] open(cmds::Base.AbstractCmd, mode::AbstractString) in Base at process.jl:562
[7] open(cmds::Base.AbstractCmd, mode::AbstractString, other::Union{RawFD, FileRedirect, IO}) in Base at process.jl:562
[8] open(cmds::Base.AbstractCmd, other::Union{RawFD, FileRedirect, IO}; write, read) in Base at process.jl:584

Mode	Description
r	read
r+	read, write
w	write, create, truncate
w+	read, write, create, truncate
a	write, create, append
a+	read, write, create, append

# 有個文字檔 sample.txt
shell> cat sample.txt

file = open("sample.txt")

# 以行為單位，讀入文字檔
file_data = readlines(file)

enumerate(file_data)

# 加上行號
for lines in enumerate(file_data)
   println(lines[1],"-> ", lines[2])
end

# 大寫列印
for line in file_data
   println(uppercase(line))
end

# 反過來列印
for line in file_data
   println(reverse(line))
end

# 計算行數
countlines("sample.txt")

# 列印第一行
first(file_data)

# 最後一行
last(file_data)

working with CSV

using DelimitedFiles

csvfile = readdlm("sample.csv", ',', String, '\n')

# 只取得第二欄資料
csvfile[:,2]

# 只取前三行資料
csvfile[1:3,:]

# 以 | 區隔資料的 sample.psv
# "Lora"|"UK"
# "James"|"UK"
readdlm("sample.psv",'|')

working with DataFrames

DataFrame 是一種類似 database 的資料結構，可以用類似 SQL 的方式存取 DataFrames

using DataFrames

# 取得所有 DataFrames 支援的 functions
names(DataFrames)

DataFrame: 2D data structure，類似 R 與 Pandas 的 DataFrame

DataArray: 比 julia 預設 Array type 提供更多 comparison 的功能，但目前已經 deprecated (ref: https://github.com/JuliaStats/DataArrays.jl)
DataFrame: 2D data structure，類似 R 與 Pandas 的 DataFrame

julia 的 Array 以 nothing 代表 missing value

julia> a = [1,2,3,nothing,5,6]
6-element Array{Union{Nothing, Int64},1}:
 1
 2
 3
  nothing
 5
 6

julia> typeof(a[4])
Nothing

DataFrame: https://juliadata.github.io/DataFrames.jl/stable/man/getting_started.html

julia> dframe = DataFrame(Names = ["John","May"], Age = [27,28])
2×2 DataFrame
│ Row │ Names  │ Age   │
│     │ String │ Int64 │
├─────┼────────┼───────┤
│ 1   │ John   │ 27    │
│ 2   │ May    │ 28    │

julia> dframe.Names
2-element Array{String,1}:
 "John"
 "May"

julia> dframe.Age
2-element Array{Int64,1}:
 27
 28

julia> names(dframe)
2-element Array{Symbol,1}:
 :Names
 :Age

julia> describe(dframe)
2×8 DataFrame
│ Row │ variable │ mean   │ min  │ median │ max │ nunique │ nmissing │ eltype   │
│     │ Symbol   │ Union… │ Any  │ Union… │ Any │ Union…  │ Nothing  │ DataType │
├─────┼──────────┼────────┼──────┼────────┼─────┼─────────┼──────────┼──────────┤
│ 1   │ Names    │        │ John │        │ May │ 2       │          │ String   │
│ 2   │ Age      │ 27.5   │ 27   │ 27.5   │ 28  │         │          │ Int64    │

另外有個獨立的 CSV 套件，可處理 csv file，讀入 CSV file 後，就得到 DataFrame 物件

using Pkg
Pkg.add("CSV")
using CSV

julia> CSV.read("sample.csv")
3×2 DataFrame
│ Row │ Lora    │ UK      │
│     │ String⍰ │ String⍰ │
├─────┼─────────┼─────────┤
│ 1   │ James   │ UK      │
│ 2   │ Raj     │ India   │
│ 3   │ May     │ America │

Linear algebra

產生 matrix

# 亂數 3x3 matrix
julia> A = rand(3,3)
3×3 Array{Float64,2}:
 0.108282  0.433033  0.247145
 0.571936  0.369386  0.547845
 0.423382  0.380503  0.77661

# array 且初始為 1
julia> ones(5)
5-element Array{Float64,1}:
 1.0
 1.0
 1.0
 1.0
 1.0

Vector 與 Matrix 的差異

Vector{T} 是 Array{T, 1} 的 alias，Matrix{T} 是Array{T, 2} 的 alias

julia> Vector{Float64} == Array{Float64,1}
true

julia> Matrix{Float64} == Array{Float64,2}
true

矩陣乘法

julia> A = rand(3,3)
3×3 Array{Float64,2}:
 0.167005  0.188954  0.0483164
 0.82603   0.440255  0.00107621
 0.137211  0.308508  0.724953

julia> b = 2*A
3×3 Array{Float64,2}:
 0.334011  0.377909  0.0966328
 1.65206   0.88051   0.00215241
 0.274422  0.617015  1.44991

julia> b= 2A
3×3 Array{Float64,2}:
 0.334011  0.377909  0.0966328
 1.65206   0.88051   0.00215241
 0.274422  0.617015  1.44991

transpose 轉置矩陣

julia> transpose_of_A = A'
3×3 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
 0.167005   0.82603     0.137211
 0.188954   0.440255    0.308508
 0.0483164  0.00107621  0.724953

inverse 逆矩陣

julia> inv(A)
3×3 Array{Float64,2}:
 -6.31559   2.41816    0.417329
 11.8591   -2.26692   -0.787013
 -3.85134   0.507016   1.63533

Determinant 行列式

julia> using LinearAlgebra

julia> det(A)
-0.05048337143385562

Eigenvalues 特徵向量

julia> eigvals(A)
3-element Array{Float64,1}:
 -0.1016764648907107
  0.5846559789763167
  0.8492342814186749

方程式運算

julia> 5
5

julia> equation = 3x^2 + 4x + 3
98

Differential calculus

Pkg.add("Calculus")
using Calculus
names(Calculus)

julia> f(x) = sin(x)
f (generic function with 1 method)
        
julia> f'(1.0)
0.5403023058631036

julia> f'(1.0) - cos(1.0)
-5.036193684304635e-12

julia> f''(1.0) - (-sin(1.0))
-6.647716624952338e-7

julia> differentiate("cos(x) + sin(x) + exp(-x) * cos(x)", :x)
:(1 * -(sin(x)) + 1 * cos(x) + ((-1 * exp(-x)) * cos(x) + exp(-x) * (1 * -(sin(x)))))

julia> differentiate("sin(x)", :x)
:(1 * cos(x))

julia> differentiate("cos(x) + sin(y) + exp(-x) * cos(y)", [:x, :y])
2-element Array{Any,1}:
 :(1 * -(sin(x)) + ((-1 * exp(-x)) * cos(y) + exp(-x) * 0))
 :(1 * cos(y) + (0 * cos(y) + exp(-x) * (1 * -(sin(y)))))

Statistics

JuliaStats 所列出支援的 packages

DataFrames
Distributions: probability distribution
MultivariateStats: Multivariate statistical analysis
HypothesisTests
MLBase: Swiss knife for machine learning
Distances: Various distances between vectors
KernelDensity
Clustering: algorithm for data clustering
GLM: Generalized linear models
NMF: Nonnegative matrix factorization
RegERMs: Regularized empirical risk minimization
MCMC: Markov Chain Monte Carlo
TimeSeries: time series analysis

simple statistics

julia> using Statistics

julia> x = [10,20,30,40,50]

julia> mean(x)
30.0

# 中位數 median
julia> median(x)
30.0

julia> sum(x)
150

# 標準差
julia> std(x)
15.811388300841896

# variance 變異數
julia> var(x)
250.0

julia 也支援 cumulative operations 的 functions

accumulate
cumsum(): cumulative summation
cumprod(): cumulative product

julia> accumulate(+, x)
5-element Array{Int64,1}:
  10
  30
  60
 100
 150

julia> accumulate(-, x)
5-element Array{Int64,1}:
   10
  -10
  -40
  -80
 -130

julia> cumsum(x)
5-element Array{Int64,1}:
  10
  30
  60
 100
 150

julia> cumprod(x)
5-element Array{Int64,1}:
       10
      200
     6000
   240000
 12000000

julia> for i in [:cumsum,:cumprod]
               @eval print($i, "->")
               @eval println($i(x))
       end
cumsum->[10, 30, 60, 100, 150]
cumprod->[10, 200, 6000, 240000, 12000000]

statistics using DataFrames

julia> using DataFrames

julia> dframe = DataFrame(Subjects = ["Maths","Physics","Chemistry"],Marks = [90,85,95])
3×2 DataFrame
│ Row │ Subjects  │ Marks │
│     │ String    │ Int64 │
├─────┼───────────┼───────┤
│ 1   │ Maths     │ 90    │
│ 2   │ Physics   │ 85    │
│ 3   │ Chemistry │ 95    │

julia> describe(dframe)
2×8 DataFrame
│ Row │ variable │ mean   │ min       │ median │ max     │ nunique │ nmissing │ eltype   │
│     │ Symbol   │ Union… │ Any       │ Union… │ Any     │ Union…  │ Nothing  │ DataType │
├─────┼──────────┼────────┼───────────┼────────┼─────────┼─────────┼──────────┼──────────┤
│ 1   │ Subjects │        │ Chemistry │        │ Physics │ 3       │          │ String   │
│ 2   │ Marks    │ 90.0   │ 85        │ 90.0   │ 95      │         │          │ Int64    │

Pandas

對習慣使用 python 進行資料分析的使用者來說，Pandas 是比較常用的 package

julia> pandasDataframe = Pandas.DataFrame(Dict(:Subjects => ["Maths","Physics","Chemistry"],:Marks => [90,85,95]))
   Marks   Subjects
0     90      Maths
1     85    Physics
2     95  Chemistry


julia> Pandas.describe(pandasDataframe)
       Marks
count    3.0
mean    90.0
std      5.0
min     85.0
25%     87.5
50%     90.0
75%     92.5
max     95.0


julia> pandasDataframe[:Subjects]
0        Maths
1      Physics
2    Chemistry
Name: Subjects, dtype: object


julia> pandasDataframe[:Marks]
0    90
1    85
2    95
Name: Marks, dtype: int64


julia> query(pandasDataframe,:(Marks>90))
   Marks   Subjects
2     95  Chemistry

Distributions

distributions.jl 有以下功能

Moments (ex: mean, variance, skewness, and kurtosis), entropy, and other properties
Probability density/mass functions (pdf) and their logarithms (logpdf)
Moment generating functions and characteristic functions
Maximum likelihood estimation
Posterior w.r.t. conjugate prior and Maximum-A-Posteriori (MAP) estimation

using Pkg
Pkg.add("Distributions")
using Distributions

distribution = Normal()

# 以 normal distribution 產生 10 筆 random 資料
x = rand(distribution, 10)

Binomial()

Cauchy()

Poisson()

TimeSeries

Pkg.add("TimeSeries")
Pkg.add("MarketData")

using TimeSeries

julia> dates  = collect(Date(2017,8,1):Day(1):Date(2017,8,5))
5-element Array{Date,1}:
 2017-08-01
 2017-08-02
 2017-08-03
 2017-08-04
 2017-08-05

julia> sample_time = TimeArray(dates, rand(length(dates)))
5×1 TimeArray{Float64,1,Date,Array{Float64,1}} 2017-08-01 to 2017-08-05
│            │ A      │
├────────────┼────────┤
│ 2017-08-01 │ 0.9142 │
│ 2017-08-02 │ 0.0834 │
│ 2017-08-03 │ 0.417  │
│ 2017-08-04 │ 0.4778 │
│ 2017-08-05 │ 0.6859 │

Hypothesis testing

測試 sample data 是否有足夠的 evidence，針對整個 population of data 進行條件測試。

有兩種 hypothesis testing 測試方式

Null hypothesis: the statement being tested
Alternative hypothesis: the statement that you will be able to conclude as true

Pkg.add("HypothesisTests")
using HypothesisTests

using Distributions

julia> sampleOne = rand(Normal(), 10)
10-element Array{Float64,1}:
 -0.5347603532683008
 -0.7882658160278007
 -0.3314222340035204
 -0.9280782524368908
  0.4540819395751322
  1.056554282551302
  1.2211198338710754
 -0.7067184060685551
  0.9788402691232629
  0.39421862072760827

julia> testOne = OneSampleTTest(sampleOne)
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0
    point estimate:          0.0815569884043313
    95% confidence interval: (-0.514, 0.6771)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.7638

Details:
    number of observations:   10
    t-statistic:              0.3097695342286937
    degrees of freedom:       9
    empirical standard error: 0.2632827937950805


julia> @which OneSampleTTest(sampleOne)
OneSampleTTest(v::AbstractArray{T,1}) where T<:Real in HypothesisTests at /Users/charley/.julia/packages/HypothesisTests/M3Ysg/src/t.jl:98

julia> pvalue(testOne)
0.7637885831202397

julia> pvalue(testOne, tail=:right)
0.38189429156011984

julia> pvalue(testOne, tail=:left)
0.6181057084398802

# check whether 25 successes from 1000 samples is inconsistent with a 50% success rate
julia> BinomialTest(25, 1000, 0.50)
Binomial test
-------------
Population details:
    parameter of interest:   Probability of success
    value under h_0:         0.5
    point estimate:          0.025
    95% confidence interval: (0.0162, 0.0367)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-99

Details:
    number of observations: 1000
    number of successes:    25

Optimization

optimization 是 finding best solution from all feasible solutions 的問題，可分為兩類

continuous optimization problem
combinatorial optimization problem (discrete)

某些問題可以用 optimization 方法解決

Shortest path
Maximum flow through a network
Vehicle routing

juila 將 optimization packages 集合在 JuliaOpt，其中有兩個最重要的 packages，這兩個都是 AML(Algebraic modeling languages)，放在 MathProgBase.jl 裡面

JuMP (Julia for Mathematical Programming)
Convex.jl

JuMP

Python 習慣使用 PuLP

Pkg.add("JuMP")
Pkg.add("Clp")
using JuMP
using Clp

julia> m = JuMP.Model()
Feasibility problem with:
 * 0 linear constraints
 * 0 variables
Solver is default solver

julia> m = JuMP.Model(solver = Clp.ClpSolver())
Feasibility problem with:
 * 0 linear constraints
 * 0 variables
Solver is Clp

optimiser.jl

using JuMP
using Clp

m = Model(solver = ClpSolver())
@variable(m, 0 <= a <= 2 )
@variable(m, 0 <= b <= 10 )

@objective(m, Max, 5a + 3*b )
@constraint(m, 1a + 5b <= 3.0 )

print(m)

status = solve(m)

println("Objective value: ", getobjectivevalue(m))
println("a = ", getvalue(a))
println("b = ", getvalue(b))

執行

$ julia optimiser.jl
Max 5 a + 3 b
Subject to
 a + 5 b ≤ 3
 0 ≤ a ≤ 2
 0 ≤ b ≤ 10
Objective value: 10.6
a = 2.0
b = 0.2

Convex.jl

用在 disciplined convex programming，使用 Convex 需要 solver，也就是 SCS

Pkg.add("Convex")
Pkg.add("SCS")

using Convex
using SCS

X = Variable(2, 2)

y = Variable()

p = minimize(vecnorm(X) + y, 2 * X <= 1, X' + y >= 1, X >= 0, y >= 0)

solve!(p)

println(X.value)

println(y.value)

p.optval