julia 提供處理大量資料,以及進行統計運算的 library
Working with data
首先是讀取資料的方式,如果是 binary 資料,可直接用 read(), write()
julia> for val in [:Bool, :Char, :Int8]
@eval println(@which read(stdin, $val))
end
read(s::IO, ::Type{Bool}) in Base at io.jl:593
read(io::IO, ::Type{Char}) in Base at io.jl:613
read(s::IO, ::Type{Int8}) in Base at io.jl:588
使用
julia> read(stdin, Char)
j
'j': ASCII/Unicode U+006a (category Ll: Letter, lowercase)
julia> read(stdin, Bool)
true
true
julia> read(stdin, Int8)
12
49
readline
julia> methods(readline)
# 4 methods for generic function "readline":
[1] readline(s::IOStream; keep) in Base at iostream.jl:433
[2] readline() in Base at io.jl:370
[3] readline(filename::AbstractString; keep) in Base at io.jl:364
[4] readline(s::IO; keep) in Base at io.jl:370
使用
julia> number = parse(Int64, readline())
12
12
julia> println(number)
12
working with text files
open 參數除了 filename,後面還有存取該檔案的 mode
julia> methods(open)
# 8 methods for generic function "open":
[1] open(fname::AbstractString; read, write, create, truncate, append) in Base at iostream.jl:275
[2] open(fname::AbstractString, mode::AbstractString) in Base at iostream.jl:339
[3] open(f::Function, cmds::Base.AbstractCmd, args...) in Base at process.jl:615
[4] open(f::Function, args...; kwargs...) in Base at iostream.jl:367
[5] open(cmds::Base.AbstractCmd) in Base at process.jl:584
[6] open(cmds::Base.AbstractCmd, mode::AbstractString) in Base at process.jl:562
[7] open(cmds::Base.AbstractCmd, mode::AbstractString, other::Union{RawFD, FileRedirect, IO}) in Base at process.jl:562
[8] open(cmds::Base.AbstractCmd, other::Union{RawFD, FileRedirect, IO}; write, read) in Base at process.jl:584
Mode | Description |
---|---|
r | read |
r+ | read, write |
w | write, create, truncate |
w+ | read, write, create, truncate |
a | write, create, append |
a+ | read, write, create, append |
# 有個文字檔 sample.txt
shell> cat sample.txt
file = open("sample.txt")
# 以行為單位,讀入文字檔
file_data = readlines(file)
enumerate(file_data)
# 加上行號
for lines in enumerate(file_data)
println(lines[1],"-> ", lines[2])
end
# 大寫列印
for line in file_data
println(uppercase(line))
end
# 反過來列印
for line in file_data
println(reverse(line))
end
# 計算行數
countlines("sample.txt")
# 列印第一行
first(file_data)
# 最後一行
last(file_data)
working with CSV
using DelimitedFiles
csvfile = readdlm("sample.csv", ',', String, '\n')
# 只取得第二欄資料
csvfile[:,2]
# 只取前三行資料
csvfile[1:3,:]
# 以 | 區隔資料的 sample.psv
# "Lora"|"UK"
# "James"|"UK"
readdlm("sample.psv",'|')
working with DataFrames
DataFrame 是一種類似 database 的資料結構,可以用類似 SQL 的方式存取 DataFrames
using DataFrames
# 取得所有 DataFrames 支援的 functions
names(DataFrames)
DataFrame: 2D data structure,類似 R 與 Pandas 的 DataFrame
- DataArray: 比 julia 預設 Array type 提供更多 comparison 的功能,但目前已經 deprecated (ref: https://github.com/JuliaStats/DataArrays.jl)
- DataFrame: 2D data structure,類似 R 與 Pandas 的 DataFrame
julia 的 Array 以 nothing 代表 missing value
julia> a = [1,2,3,nothing,5,6]
6-element Array{Union{Nothing, Int64},1}:
1
2
3
nothing
5
6
julia> typeof(a[4])
Nothing
DataFrame: https://juliadata.github.io/DataFrames.jl/stable/man/getting_started.html
julia> dframe = DataFrame(Names = ["John","May"], Age = [27,28])
2×2 DataFrame
│ Row │ Names │ Age │
│ │ String │ Int64 │
├─────┼────────┼───────┤
│ 1 │ John │ 27 │
│ 2 │ May │ 28 │
julia> dframe.Names
2-element Array{String,1}:
"John"
"May"
julia> dframe.Age
2-element Array{Int64,1}:
27
28
julia> names(dframe)
2-element Array{Symbol,1}:
:Names
:Age
julia> describe(dframe)
2×8 DataFrame
│ Row │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │
│ │ Symbol │ Union… │ Any │ Union… │ Any │ Union… │ Nothing │ DataType │
├─────┼──────────┼────────┼──────┼────────┼─────┼─────────┼──────────┼──────────┤
│ 1 │ Names │ │ John │ │ May │ 2 │ │ String │
│ 2 │ Age │ 27.5 │ 27 │ 27.5 │ 28 │ │ │ Int64 │
另外有個獨立的 CSV 套件,可處理 csv file,讀入 CSV file 後,就得到 DataFrame 物件
using Pkg
Pkg.add("CSV")
using CSV
julia> CSV.read("sample.csv")
3×2 DataFrame
│ Row │ Lora │ UK │
│ │ String⍰ │ String⍰ │
├─────┼─────────┼─────────┤
│ 1 │ James │ UK │
│ 2 │ Raj │ India │
│ 3 │ May │ America │
Linear algebra
產生 matrix
# 亂數 3x3 matrix
julia> A = rand(3,3)
3×3 Array{Float64,2}:
0.108282 0.433033 0.247145
0.571936 0.369386 0.547845
0.423382 0.380503 0.77661
# array 且初始為 1
julia> ones(5)
5-element Array{Float64,1}:
1.0
1.0
1.0
1.0
1.0
Vector 與 Matrix 的差異
Vector{T} 是 Array{T, 1} 的 alias,Matrix{T} 是Array{T, 2} 的 alias
julia> Vector{Float64} == Array{Float64,1}
true
julia> Matrix{Float64} == Array{Float64,2}
true
矩陣乘法
julia> A = rand(3,3)
3×3 Array{Float64,2}:
0.167005 0.188954 0.0483164
0.82603 0.440255 0.00107621
0.137211 0.308508 0.724953
julia> b = 2*A
3×3 Array{Float64,2}:
0.334011 0.377909 0.0966328
1.65206 0.88051 0.00215241
0.274422 0.617015 1.44991
julia> b= 2A
3×3 Array{Float64,2}:
0.334011 0.377909 0.0966328
1.65206 0.88051 0.00215241
0.274422 0.617015 1.44991
transpose 轉置矩陣
julia> transpose_of_A = A'
3×3 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
0.167005 0.82603 0.137211
0.188954 0.440255 0.308508
0.0483164 0.00107621 0.724953
inverse 逆矩陣
julia> inv(A)
3×3 Array{Float64,2}:
-6.31559 2.41816 0.417329
11.8591 -2.26692 -0.787013
-3.85134 0.507016 1.63533
Determinant 行列式
julia> using LinearAlgebra
julia> det(A)
-0.05048337143385562
Eigenvalues 特徵向量
julia> eigvals(A)
3-element Array{Float64,1}:
-0.1016764648907107
0.5846559789763167
0.8492342814186749
方程式運算
julia> 5
5
julia> equation = 3x^2 + 4x + 3
98
Differential calculus
Pkg.add("Calculus")
using Calculus
names(Calculus)
julia> f(x) = sin(x)
f (generic function with 1 method)
julia> f'(1.0)
0.5403023058631036
julia> f'(1.0) - cos(1.0)
-5.036193684304635e-12
julia> f''(1.0) - (-sin(1.0))
-6.647716624952338e-7
julia> differentiate("cos(x) + sin(x) + exp(-x) * cos(x)", :x)
:(1 * -(sin(x)) + 1 * cos(x) + ((-1 * exp(-x)) * cos(x) + exp(-x) * (1 * -(sin(x)))))
julia> differentiate("sin(x)", :x)
:(1 * cos(x))
julia> differentiate("cos(x) + sin(y) + exp(-x) * cos(y)", [:x, :y])
2-element Array{Any,1}:
:(1 * -(sin(x)) + ((-1 * exp(-x)) * cos(y) + exp(-x) * 0))
:(1 * cos(y) + (0 * cos(y) + exp(-x) * (1 * -(sin(y)))))
Statistics
JuliaStats 所列出支援的 packages
- DataFrames
- Distributions: probability distribution
- MultivariateStats: Multivariate statistical analysis
- HypothesisTests
- MLBase: Swiss knife for machine learning
- Distances: Various distances between vectors
- KernelDensity
- Clustering: algorithm for data clustering
- GLM: Generalized linear models
- NMF: Nonnegative matrix factorization
- RegERMs: Regularized empirical risk minimization
- MCMC: Markov Chain Monte Carlo
- TimeSeries: time series analysis
simple statistics
julia> using Statistics
julia> x = [10,20,30,40,50]
julia> mean(x)
30.0
# 中位數 median
julia> median(x)
30.0
julia> sum(x)
150
# 標準差
julia> std(x)
15.811388300841896
# variance 變異數
julia> var(x)
250.0
julia 也支援 cumulative operations 的 functions
- accumulate
- cumsum(): cumulative summation
- cumprod(): cumulative product
julia> accumulate(+, x)
5-element Array{Int64,1}:
10
30
60
100
150
julia> accumulate(-, x)
5-element Array{Int64,1}:
10
-10
-40
-80
-130
julia> cumsum(x)
5-element Array{Int64,1}:
10
30
60
100
150
julia> cumprod(x)
5-element Array{Int64,1}:
10
200
6000
240000
12000000
julia> for i in [:cumsum,:cumprod]
@eval print($i, "->")
@eval println($i(x))
end
cumsum->[10, 30, 60, 100, 150]
cumprod->[10, 200, 6000, 240000, 12000000]
statistics using DataFrames
julia> using DataFrames
julia> dframe = DataFrame(Subjects = ["Maths","Physics","Chemistry"],Marks = [90,85,95])
3×2 DataFrame
│ Row │ Subjects │ Marks │
│ │ String │ Int64 │
├─────┼───────────┼───────┤
│ 1 │ Maths │ 90 │
│ 2 │ Physics │ 85 │
│ 3 │ Chemistry │ 95 │
julia> describe(dframe)
2×8 DataFrame
│ Row │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │
│ │ Symbol │ Union… │ Any │ Union… │ Any │ Union… │ Nothing │ DataType │
├─────┼──────────┼────────┼───────────┼────────┼─────────┼─────────┼──────────┼──────────┤
│ 1 │ Subjects │ │ Chemistry │ │ Physics │ 3 │ │ String │
│ 2 │ Marks │ 90.0 │ 85 │ 90.0 │ 95 │ │ │ Int64 │
Pandas
對習慣使用 python 進行資料分析的使用者來說,Pandas 是比較常用的 package
julia> pandasDataframe = Pandas.DataFrame(Dict(:Subjects => ["Maths","Physics","Chemistry"],:Marks => [90,85,95]))
Marks Subjects
0 90 Maths
1 85 Physics
2 95 Chemistry
julia> Pandas.describe(pandasDataframe)
Marks
count 3.0
mean 90.0
std 5.0
min 85.0
25% 87.5
50% 90.0
75% 92.5
max 95.0
julia> pandasDataframe[:Subjects]
0 Maths
1 Physics
2 Chemistry
Name: Subjects, dtype: object
julia> pandasDataframe[:Marks]
0 90
1 85
2 95
Name: Marks, dtype: int64
julia> query(pandasDataframe,:(Marks>90))
Marks Subjects
2 95 Chemistry
Distributions
distributions.jl 有以下功能
- Moments (ex: mean, variance, skewness, and kurtosis), entropy, and other properties
- Probability density/mass functions (pdf) and their logarithms (logpdf)
- Moment generating functions and characteristic functions
- Maximum likelihood estimation
- Posterior w.r.t. conjugate prior and Maximum-A-Posteriori (MAP) estimation
using Pkg
Pkg.add("Distributions")
using Distributions
distribution = Normal()
# 以 normal distribution 產生 10 筆 random 資料
x = rand(distribution, 10)
Binomial()
Cauchy()
Poisson()
TimeSeries
Pkg.add("TimeSeries")
Pkg.add("MarketData")
using TimeSeries
julia> dates = collect(Date(2017,8,1):Day(1):Date(2017,8,5))
5-element Array{Date,1}:
2017-08-01
2017-08-02
2017-08-03
2017-08-04
2017-08-05
julia> sample_time = TimeArray(dates, rand(length(dates)))
5×1 TimeArray{Float64,1,Date,Array{Float64,1}} 2017-08-01 to 2017-08-05
│ │ A │
├────────────┼────────┤
│ 2017-08-01 │ 0.9142 │
│ 2017-08-02 │ 0.0834 │
│ 2017-08-03 │ 0.417 │
│ 2017-08-04 │ 0.4778 │
│ 2017-08-05 │ 0.6859 │
Hypothesis testing
測試 sample data 是否有足夠的 evidence,針對整個 population of data 進行條件測試。
有兩種 hypothesis testing 測試方式
- Null hypothesis: the statement being tested
- Alternative hypothesis: the statement that you will be able to conclude as true
Pkg.add("HypothesisTests")
using HypothesisTests
using Distributions
julia> sampleOne = rand(Normal(), 10)
10-element Array{Float64,1}:
-0.5347603532683008
-0.7882658160278007
-0.3314222340035204
-0.9280782524368908
0.4540819395751322
1.056554282551302
1.2211198338710754
-0.7067184060685551
0.9788402691232629
0.39421862072760827
julia> testOne = OneSampleTTest(sampleOne)
One sample t-test
-----------------
Population details:
parameter of interest: Mean
value under h_0: 0
point estimate: 0.0815569884043313
95% confidence interval: (-0.514, 0.6771)
Test summary:
outcome with 95% confidence: fail to reject h_0
two-sided p-value: 0.7638
Details:
number of observations: 10
t-statistic: 0.3097695342286937
degrees of freedom: 9
empirical standard error: 0.2632827937950805
julia> @which OneSampleTTest(sampleOne)
OneSampleTTest(v::AbstractArray{T,1}) where T<:Real in HypothesisTests at /Users/charley/.julia/packages/HypothesisTests/M3Ysg/src/t.jl:98
julia> pvalue(testOne)
0.7637885831202397
julia> pvalue(testOne, tail=:right)
0.38189429156011984
julia> pvalue(testOne, tail=:left)
0.6181057084398802
# check whether 25 successes from 1000 samples is inconsistent with a 50% success rate
julia> BinomialTest(25, 1000, 0.50)
Binomial test
-------------
Population details:
parameter of interest: Probability of success
value under h_0: 0.5
point estimate: 0.025
95% confidence interval: (0.0162, 0.0367)
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: <1e-99
Details:
number of observations: 1000
number of successes: 25
Optimization
optimization 是 finding best solution from all feasible solutions 的問題,可分為兩類
- continuous optimization problem
- combinatorial optimization problem (discrete)
某些問題可以用 optimization 方法解決
- Shortest path
- Maximum flow through a network
- Vehicle routing
juila 將 optimization packages 集合在 JuliaOpt,其中有兩個最重要的 packages,這兩個都是 AML(Algebraic modeling languages),放在 MathProgBase.jl 裡面
- JuMP (Julia for Mathematical Programming)
- Convex.jl
JuMP
Python 習慣使用 PuLP
Pkg.add("JuMP")
Pkg.add("Clp")
using JuMP
using Clp
julia> m = JuMP.Model()
Feasibility problem with:
* 0 linear constraints
* 0 variables
Solver is default solver
julia> m = JuMP.Model(solver = Clp.ClpSolver())
Feasibility problem with:
* 0 linear constraints
* 0 variables
Solver is Clp
optimiser.jl
using JuMP
using Clp
m = Model(solver = ClpSolver())
@variable(m, 0 <= a <= 2 )
@variable(m, 0 <= b <= 10 )
@objective(m, Max, 5a + 3*b )
@constraint(m, 1a + 5b <= 3.0 )
print(m)
status = solve(m)
println("Objective value: ", getobjectivevalue(m))
println("a = ", getvalue(a))
println("b = ", getvalue(b))
執行
$ julia optimiser.jl
Max 5 a + 3 b
Subject to
a + 5 b ≤ 3
0 ≤ a ≤ 2
0 ≤ b ≤ 10
Objective value: 10.6
a = 2.0
b = 0.2
Convex.jl
用在 disciplined convex programming,使用 Convex 需要 solver,也就是 SCS
Pkg.add("Convex")
Pkg.add("SCS")
using Convex
using SCS
X = Variable(2, 2)
y = Variable()
p = minimize(vecnorm(X) + y, 2 * X <= 1, X' + y >= 1, X >= 0, y >= 0)
solve!(p)
println(X.value)
println(y.value)
p.optval
沒有留言:
張貼留言