Tuesday, November 13, 2018

PostgreSQL: creating OHLC or "candlestick" output from quote/tick raw material

As a quick note to self:

(jdbc/with-db-connection [conn -db-conn-]
  (let [t (time.coerce/from-string "2018-02-01")]
    (time
     (dotimes [i 50]
       (let [ts-start (time/plus t (time/hours i))
             ts-end (time/plus t (time/hours (inc i)))]
         (time (println {:data (jdbc/query conn ["SELECT
        first_value(bid_price) OVER w AS first_bid_price,
        first_value(ask_price) OVER w AS first_ask_price,
        MAX(bid_price) OVER w AS max_bid_price,
        MAX(ask_price) OVER w AS max_ask_price,
        MIN(bid_price) OVER w AS min_bid_price,
        MIN(ask_price) OVER w AS min_ask_price,
        last_value(bid_price) OVER w AS last_bid_price,
        last_value(ask_price) OVER w AS last_ask_price
  FROM bitmex_quote
  WHERE timestamp >= ? AND timestamp < ? AND symbol = 'XBTUSD'
  WINDOW w AS (ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
  ORDER BY timestamp ASC
  LIMIT 1;"
                                                 (time.coerce/to-sql-time ts-start) (time.coerce/to-sql-time ts-end)])
                         :timestamp (str ts-end)})))))))


...for this to perform reasonably you'll need to do indexing.

Monday, November 12, 2018

Notes on database stuff: PostgreSQL indexing performance

It might be a very good idea to take a look at e.g. https://www.timescale.com/

Total number of rows (quote data only): 749 640 222

Querying for a single 1 hour "candle" (sort of..) without index: 5 minutes. In other words; completely useless.

..doing the same with index: around 200ms or 1500% (15x) faster.  UPDATE: I fiddled around and ended up with these numbers: https://gist.github.com/lnostdal/9e7e710290cd7448cd81b29fc744fbc7 ..this should be faster than my backtester at the moment; at least for non-trivial stuff.

The query I used to test this:

SELECT MAX(bid_price) AS max_bid_price, MIN(bid_price) AS min_bid_price, MAX(ask_price) AS max_ask_price, MIN(ask_price) AS min_ask_price 
FROM bitmex_quote
WHERE timestamp >= '2018-07-30 00:00:00' AND timestamp < '2018-07-30 01:00:00' AND symbol = 'XBTUSD';

I'm sure there are ways of organizing and indexing this kind of data which are better and faster than what I've done, but this is an OK start for now.

Building a partial index takes about 30 minutes for each `symbol`. I'm sure all of this would be much faster on proper hardware and finalized configuration of everything; the point are the nice % speedups.

Here's how I built the index:

CREATE INDEX idx_timestamp_xbtusd
    ON public.bitmex_quote
USING btree
    ("timestamp" ASC NULLS LAST)
    TABLESPACE pg_default
WHERE symbol = 'XBTUSD';

I create one index like this for each `symbol`. These are partial indexes which I think makes sense for this.

https://www.postgresql.org/docs/11/rules-materializedviews.html is probably a nice way to build very fast viewports of the most common timeframes; 1min, 3min, 5min, 15min, 30min, 1h, 2h and so on.

Friday, November 2, 2018

Notes on Clojure: memory leaks, performance and more

work in progress; i'll add more to this later

Memory leaks

  1. https://www.eclipse.org/mat/ is great and works with Clojure.
  2. When debugging, setting -Dclojure.compiler.disable-locals-clearing=true can be nice, but make sure you set this back to false (the default) because it can (will?) create memory leaks in non-trivial code.
  3. In some cases you should use eduction instead of sequence when dealing with transducers and huge amounts of data.
  4. Not related to Clojure, but Emacs actually has some leak problems lately: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=26952 so make sure you run a recent Emacs. This is not a problem that bit me often though.
Spent the last 2 days debugging memory. Turns out 90% of the problem was a wrong JVM flag (#1 above) -- and the remaining 10% of the problem was an issue with clojure.core.async which I'll try to document later. 

Performance

  1. -Dclojure.compiler.direct-linking=true leads to massive performance improvements in some cases. You'll need to change how you do interactive development after setting this flag tho. I also find return value type hints work better when this is true -- which can help with debugging.
  2. Make sure you set *warn-on-reflection* to true and *unchecked-math* to :warn-on-boxed. Dealing with this correctly will probably give you the biggest performance of everything else mentioned here.
  3. It is very important to not use Fns like FIRST on ArrayLists, ArrayDeques and similar. As it will actually generate a temporary object via an internal call to SEQ on every invocation(!). This might be obvious, but it's easy to forget because a simple call to FIRST seems like something that "should" be cheap to do. Stick with the JVM specific methods and interfaces for these classes; get, peekFirst, peekLast etc..
  4. Also, Vector implements IFn which means you can do (my-vector 0) instead of (first my-vector) for a 4-5x win.
  5. Always use PEEK instead of LAST when you can.
  6. This looks useful: https://github.com/clojure-goes-fast
 ..I'll add more tips about performance in future posts, here: https://quantoga.blogspot.com/search/label/performance