hivemallのロジスティック回帰のサンプルを動かすまでの流れのメモです。

Hadoopとhiveのセットアップ
サンプルのデータセットとhivemallのダウンロード
サンプル(logistic regression)を動かしてみる

試すサンプル:
https://github.com/myui/hivemall/wiki/a9a-binary-classification-(logistic-regression)

Hadoopとhiveのセットアップ

以下のエントリの「Hadoopとhiveのセットアップ」と同じなので省略。
※MacOSXの場合の手順ですが。

MacにhiveをセットアップしてS3上のファイルにアクセスするまで
http://takemikami.com/2016/03/31/MachiveS3.html

サンプルのデータセットとhivemallのダウンロード

hivemallのダウンロード

以下のURLからhivemallのjarをダウンロードします。

URL: https://github.com/myui/hivemall/releases

ファイル: hivemall-0.3.2-3-with-dependencies.jar;

※0.4系だとうまく動かなかったので、ここでは0.3系で試します。

サンプルデータセットをダウンロード

以下のURLからサンプルデータセットをダウンロードします。

URL: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a9a

ファイル:

a9a
a9a.t

データセットについてはここに説明があります。
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

サンプル(logistic regression)を動かしてみる

以下のURLの説明に従って、logistic regressionのサンプルを動かしてみます。

URL: https://github.com/myui/hivemall/wiki/a9a-binary-classification-(logistic-regression)

データセットをテーブルにロード

以下のようにして、サンプルデータセットをa9atrainとa9atestにロードします。ダウンロードしたサンプルデータセットは/tmpにあるものとします。

a9atrain

hive> create table a9atrain_original(text string);
hive> load data local inpath '/tmp/a9a' into table a9atrain_original;
hive> create table a9atrain as select
 regexp_replace(reflect('java.util.UUID','randomUUID'), '-', '') rowid,
 case when substr(text,1,2) >0 then 1 else 0 end label,
 split(trim(substr(text,4)),' ') features
from a9atrain_original;
hive> drop table a9atrain_original;

a9atest

hive> create table a9atest_original(text string);
hive> load data local inpath '/tmp/a9a.t' into table a9atest_original;
hive> create table a9atest as select
 regexp_replace(reflect('java.util.UUID','randomUUID'), '-', '') rowid,
 case when substr(text,1,2) >0 then 1 else 0 end label,
 split(trim(substr(text,4)),' ') features
from a9atest_original;
hive> drop table a9atest_original;

取り込んだデータセットをhiveでselectして見てみます。

hive> select * from a9atrain limit 5;
4fd4b9e9b46a4638b59ccf5122c062f7	0	["3:1","11:1","14:1","19:1","39:1","42:1","55:1","64:1","67:1","73:1","75:1","76:1","80:1","83:1"]
da1230820aa744d690898c8b42fc03c0	0	["5:1","7:1","14:1","19:1","39:1","40:1","51:1","63:1","67:1","73:1","74:1","76:1","78:1","83:1"]
f85b1daf18534c2982d88c587b854458	0	["3:1","6:1","17:1","22:1","36:1","41:1","53:1","64:1","67:1","73:1","74:1","76:1","80:1","83:1"]
b59be913a0024a289d63c0e794b9c3c9	0	["5:1","6:1","17:1","21:1","35:1","40:1","53:1","63:1","71:1","73:1","74:1","76:1","80:1","83:1"]
b7ff8b9f38ae4d889739a1d1f8721cd1	0	["2:1","6:1","18:1","19:1","39:1","40:1","52:1","61:1","71:1","72:1","74:1","76:1","80:1","95:1"]

データセットは以下の列で構成されています。

1列目: rowid レコード毎のユニークID
2列目: label ラベル推定したい値（0 or 1）
3列目: features 特徴量の配列 (特徴量ID:特徴量の値の組)

データセットのレコード件数をチェック

以下のように、データセットのレコード件数を調べパラメータに設定します。

> select count(1) from a9atrain;
> set hivevar:total_steps=32561;

> select count(1) from a9atest;
> set hivevar:num_test_instances=16281;

トレーニングデータからモデルを作成する(training)

以下のようにして、hivemallのUDFを設定します。
※ダウンロードしたjarファイルは/tmpにあるものとします。

hive> add jar /tmp/hivemall-0.3.2-3-with-dependencies.jar;
hive> create temporary function addBias as 'hivemall.ftvec.AddBiasUDF';
hive> create temporary function logress as 'hivemall.regression.LogressUDTF';

以下のクエリで学習させます。

hive> create table a9a_model1
as
select
 cast(feature as int) as feature,
 avg(weight) as weight
from
 (select
     logress(addBias(features),label,"-total_steps ${total_steps}") as (feature,weight)
  from
     a9atrain
 ) t
group by feature;

作成したモデルをhiveでselectして見てみます。

hive> select * from a9a_model1 limit 5;
0	-0.5761121511459351
1	-1.5259535312652588
10	0.21053194999694824
100	-0.017715860158205032
101	0.007558753248304129

モデルのテーブルは以下の列で構成されています。

1列目: feature 特徴量のID
2列目: weight 特徴量の重み(係数)

テストデータに対して予測を行う (prediction)

以下のようにして、hivemallのUDFを設定します。
※ダウンロードしたjarファイルは/tmpにあるものとします。

hive> create temporary function extract_feature as 'hivemall.ftvec.ExtractFeatureUDF';
hive> create temporary function extract_weight as 'hivemall.ftvec.ExtractWeightUDF';
hive> create temporary function sigmoid as 'hivemall.tools.math.SigmodUDF';

以下のクエリで予測します。

hive> create or replace view a9a_predict1
as
WITH a9atest_exploded as (
select
  rowid,
  label,
  extract_feature(feature) as feature,
  extract_weight(feature) as value
from
  a9atest LATERAL VIEW explode(addBias(features)) t AS feature
)
select
  t.rowid,
  sigmoid(sum(m.weight * t.value)) as prob,
  CAST((case when sigmoid(sum(m.weight * t.value)) >= 0.5 then 1.0 else 0.0 end) as FLOAT) as label
from
  a9atest_exploded t LEFT OUTER JOIN
  a9a_model1 m ON (t.feature = m.feature)
group by
  t.rowid;

予測した結果をhiveでselectして見てみます。

hive> select * from a9a_predict1 limit 5;
0000c75d50fc4db093ee0aa663d19266	0.45304257	0.0
00033d759e20486887ae50639fcd03c0	0.17149617	0.0
000592244c6a4e669f5fcaf394e55807	0.068347506	0.0
000cfdad24f241ffbef9d2ece5a909a2	0.4040777	0.0
00198661df1c43c79bfcb4c879b0e82b	0.048144594	0.0

予測結果は以下の列で構成されています。

1列目: rowid レコード毎のユニークID
2列目: prob 確率
3列目: label ラベル推定した値(0 or 1)

予測結果の評価

以下のクエリで、予測結果を評価します。予測した値が正しかった割合を計算しています。

hive> create or replace view a9a_submit1 as
select
  t.label as actual,
  pd.label as predicted,
  pd.prob as probability
from
  a9atest t JOIN a9a_predict1 pd
    on (t.rowid = pd.rowid);
hive> select count(1) / ${num_test_instances} from a9a_submit1
where actual == predicted;
0.8430071862907684

以上、チュートリアルをなぞっただけですが。。