Time Series Forecasting (3): GRU — Lightweight Gates and Efficiency Trade-offs

Tue, 01 Oct 2024 09:00:00 +0000

After you’ve used LSTM for a while, an obvious question shows up: aren’t three gates a bit much? The forget and input gates seem to do related work — one decides what to drop, the other decides what to add — couldn’t they be merged? And does the cell state really need to be a separate vector from the hidden state, or could the hidden state do double duty?

That is exactly the question Cho et al. answered in 2014 with the Gated Recurrent Unit. They collapsed three gates into two: an update gate that controls how much of the old state to keep versus how much new content to absorb, and a reset gate that decides whether to ignore the old state entirely when computing a fresh candidate. The cell state is folded back into the hidden state. The result is roughly 25% fewer parameters, training that runs 10-15% faster, and accuracy on most time-series tasks that is statistically indistinguishable from LSTM.

GRU on Chen Kai Blog

Time Series Forecasting (3): GRU — Lightweight Gates and Efficiency Trade-offs