Time Bucket 과 Gapfilling

문서번호 : 11-252611

Document Information

•

최초 작성일 : 2022.04.11

•

최종 수정일 : 2022.04.11

•

이 문서는 아래 버전을 기준으로 작성되었습니다.

◦

SinglestoreDB : 7.8

Goal

시계열 (Time series) 데이터의 분석을 위해 시간 값을 분석 단위로 정렬하고, 데이터가 존재하지 않아 발생하는 Gap 을 filling하는 방법을 테스트한다.

Solution

테스트를 위한 table 과 sample data를 생성합니다.

crtable.sql

use jnshin;

DROP TABLE IF EXISTS tick;
CREATE TABLE tick
(
  ts datetime(6)
  , code int
  , price int
);


DELIMITER //
DO
BEGIN
  FOR i in 1 .. 500 LOOP
    insert into tick values (
      timestampadd(minute, rand() * 1440 * 2, current_date :> timestamp)
      , 102960
      , rand() * 500 + 5400
    );
  end loop;
  commit;
END;
//
DELIMITER ;

select count(*) from tick;
SQL
복사

singlestore> \. crtable.sql
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.24 sec)

+----------+
| count(*) |
+----------+
|      500 |
+----------+
1 row in set (0.01 sec)
SQL
복사

time_bucket 기능을 이용해 시간 값이 저장된 ts 를 분석단위인 15분 간격으로 변환하는 과정입니다. 만약 시간 시작을 특정 시각에서 시작하길 원한다면, ts_bk5 처럼 bucket의 시작시점을 지정할 수 있습니다. 지정된 bucket 시작 이전 값은 이전 bucket으로 변환되는 것을 볼 수 있습니다.

select ts 
  , time_bucket("15m", ts) ts_bk0
  , time_bucket("15m", ts, timestampadd(minute, 5, current_date :> timestamp) ) ts_bk5
  , price
 from tick
 order by ts
 limit 10;
SQL
복사

+----------------------------+----------------------------+----------------------------+-------+
| ts                         | ts_bk0                     | ts_bk5                     | price |
+----------------------------+----------------------------+----------------------------+-------+
| 2022-04-11 00:04:00.000000 | 2022-04-11 00:00:00.000000 | 2022-04-10 23:50:00.000000 |  5477 |
| 2022-04-11 00:05:00.000000 | 2022-04-11 00:00:00.000000 | 2022-04-11 00:05:00.000000 |  5652 |
| 2022-04-11 00:16:00.000000 | 2022-04-11 00:15:00.000000 | 2022-04-11 00:05:00.000000 |  5816 |
| 2022-04-11 00:18:00.000000 | 2022-04-11 00:15:00.000000 | 2022-04-11 00:05:00.000000 |  5527 |
| 2022-04-11 00:19:00.000000 | 2022-04-11 00:15:00.000000 | 2022-04-11 00:05:00.000000 |  5415 |
| 2022-04-11 00:20:00.000000 | 2022-04-11 00:15:00.000000 | 2022-04-11 00:20:00.000000 |  5736 |
| 2022-04-11 00:20:00.000000 | 2022-04-11 00:15:00.000000 | 2022-04-11 00:20:00.000000 |  5733 |
| 2022-04-11 00:30:00.000000 | 2022-04-11 00:30:00.000000 | 2022-04-11 00:20:00.000000 |  5590 |
| 2022-04-11 00:39:00.000000 | 2022-04-11 00:30:00.000000 | 2022-04-11 00:35:00.000000 |  5574 |
| 2022-04-11 00:41:00.000000 | 2022-04-11 00:30:00.000000 | 2022-04-11 00:35:00.000000 |  5712 |
+----------------------------+----------------------------+----------------------------+-------+
SQL
복사

이처럼 분석 단위로 보정된 시간 값을 이용해 smooth 또는 분석을 위한 windows function 을 사용합니다.

select 
  time_bucket("15m", ts) as ts_bk0
  , avg(price) over w
  , count(1) over w
 from tick 
 group by 1
 window w as (partition by time_bucket("15m", ts) order by ts)
 order by ts
 limit 10;
SQL
복사

+----------------------------+-------------------+-----------------+
| ts_bk0                     | avg(price) over w | count(1) over w |
+----------------------------+-------------------+-----------------+
| 2022-04-11 00:00:00.000000 |         5652.0000 |               1 |
| 2022-04-11 00:15:00.000000 |         5816.0000 |               1 |
| 2022-04-11 00:30:00.000000 |         5574.0000 |               1 |
| 2022-04-11 00:45:00.000000 |         5494.0000 |               1 |
| 2022-04-11 01:00:00.000000 |         5827.0000 |               1 |
| 2022-04-11 01:15:00.000000 |         5886.0000 |               1 |
| 2022-04-11 01:45:00.000000 |         5552.0000 |               1 |
| 2022-04-11 02:00:00.000000 |         5759.0000 |               1 |
| 2022-04-11 02:15:00.000000 |         5445.0000 |               1 |
| 2022-04-11 02:45:00.000000 |         5586.0000 |               1 |
+----------------------------+-------------------+-----------------+
SQL
복사

위 결과를 살펴 보면, 각 bucket 별로 수집된 data가 1건 씩이라는 것을 확인할 수 있습니다. 그렇지만, 수집되지 않은 ‘01:30’, ‘02:30’ 은 자료가 조회되지 않았습니다. 이렇게 시계열 값이 누락된 Gap을 채워주는 과정을 Gap filling이라합니다. 다음과 같이 간단히 시계열 값을 생성해 활용합니다.

gapfilling.sql

use jnshin;


with int_seq as (
  select a.table_col * 10 + b.table_col seq
  FROM table([0,1,2,3,4,5,6,7,8,9]) a
     , table([0,1,2,3,4,5,6,7,8,9]) b
  order by seq
  limit 96
)
, time_series as (
  select seq
    , adddate(current_date :> timestamp, interval p.seq*15 MINUTE) ts
  from int_seq p
)
, bucketed_value as (
  select time_bucket("15m", ts) ts
    , count(*) data_count
    , avg(price)  over w as price_avg
  from tick
  group by 1
  window w as (partition by time_bucket("15m", ts) order by ts)
)
select a.ts, nvl(b.data_count, 0), nvl(price_avg, 0)
from time_series a left outer join bucketed_value b
     on a.ts = b.ts
group by a.ts
order by a.ts
limit 14;
SQL
복사

+---------------------+----------------------+-------------------+
| ts                  | nvl(b.data_count, 0) | nvl(price_avg, 0) |
+---------------------+----------------------+-------------------+
| 2022-04-11 00:00:00 |                    2 |         5477.0000 |
| 2022-04-11 00:15:00 |                    5 |         5527.0000 |
| 2022-04-11 00:30:00 |                    3 |         5590.0000 |
| 2022-04-11 00:45:00 |                    7 |         5525.0000 |
| 2022-04-11 01:00:00 |                    4 |         5827.0000 |
| 2022-04-11 01:15:00 |                    4 |         5886.0000 |
| 2022-04-11 01:30:00 |                    0 |            0.0000 |
| 2022-04-11 01:45:00 |                    3 |         5849.0000 |
| 2022-04-11 02:00:00 |                    1 |         5759.0000 |
| 2022-04-11 02:15:00 |                    1 |         5445.0000 |
| 2022-04-11 02:30:00 |                    0 |            0.0000 |
| 2022-04-11 02:45:00 |                    3 |         5811.0000 |
| 2022-04-11 03:00:00 |                    1 |         5635.0000 |
| 2022-04-11 03:15:00 |                    0 |            0.0000 |
+---------------------+----------------------+-------------------+
SQL
복사

Gap filling을 수행하기 전에 누락되었던 값들이 조회됩니다.

비전하게 Gap filling을 실행하는 환경이라면, CTE (Common Table Expression) 으로 정의한 int_seq 또는 time_series 를 table 로 정의할 수 있다.

References

TIME_BUCKET - v7.8

With (Common Table Expression) - v7.8

History

일자	작성자	비고
2022-04-11	jnshin	최초작성

A-platform