1. Background
  2. Problem statement
  3. Model architecture
  4. Training data
  5. Results
  6. Conclusions
  7. Impact statement
  8. Future directions
  9. Contributions
  10. Acknowledgements and References

Appendix

6 Conclusions

Toto, through a novel architecture and pre-training corpus, demonstrates state-of-the-art performance both on public benchmarks and on the Datadog observability benchmark. We look forward to sharing many more technical details, experiments, and benchmark results in a forthcoming paper.

7 Impact statement

In developing Toto, Datadog follows a structured approach to ensure responsible development, focusing on identifying, assessing, and mitigating potential risks associated with the use of our model. Given that Toto is not intended for mass distribution and specifically generates time series forecasts for observability data, the potential harms are considerably lower compared to more general-purpose models. At Datadog, our primary focus is on ensuring the accuracy, reliability, and security of the forecasts generated by Toto, which are crucial for maintaining and optimizing infrastructure and application performance.

We carefully analyze the potential for Toto to produce incorrect or misleading forecasts that could impact decision-making processes in critical systems. Additionally, we consider the implications of Toto's performance across diverse datasets, ensuring it can generalize well without introducing significant errors.

8 Future directions

Many exciting areas of exploration remain for further study. If you are interested in working with us, please reach out to the authors by email.

Some future research questions that particularly intrigue us include:

• Multi-modal inputs: Incorporate additional input modalities such as query metadata and captions to enhance forecast performance.

• Autonomous troubleshooting agents: Augment Datadog's AI agents [50] for troubleshooting and incident response by integrating modality-specific foundation models like Toto to improve their reasoning and planning capabilities with telemetry data.

• Conversational interfaces: Align time series forecasting models with LLMs to develop conversational agents capable of interpreting and reasoning about time series data.

• Model enhancements and scaling: Explore ways to improve and scale model performance through optimizations such as new types of input embeddings, attention mechanisms, and examining alternative variate groupings to capture richer interactions.

9 Contributions

Contributors are listed in alphabetical order.

Othmane Abou-Amal, Joseph Banks, Mayeul Blanzat, Ben Cohen, Youssef Doubli, Ben Hinthorne, Emaad Khwaja, Jared Ledvina, Charles Masson, Sajid Mehmood, Elise Ram´e, Maxime Visonneau, Kan Wang.

10 Acknowledgements

Our work is made possible by the efforts of numerous teams at Datadog. Special thanks and acknowledgement to:

Johan Andersen, Roashan Ayene, Romoli Bakshi, Kevin Beach, Bill Birkholz, Rob Boll, Maxim Brown, Benedetto Buratti, Marion Chan-Renous, Jessica Cordonnier, Ben Donohue, Zakaria Fikrat, Quentin Franc¸ois, Erica Hale, Michael Hoang, Joe Jones, Max Livingston, Jesse Mack, Amine Naouas, Sean O'Connor, Brendan Rhoads, Phil Sarin, Vyom Shah, Aaron Taa, Bharath Vontimitta, Dominique West, Steven Zhou.

References

[1] Datadog. Observability platform, 2024. URL https://www.datadoghq. com/monitoring/observability-platform/.

[2] Datadog. Modern infrastructure monitoring, 2024. URL https://www. datadoghq.com/product/infrastructure-monitoring/.

[3] Rob J Hyndman and George Athanasopoulos. Forecasting: Principles and Practice. OTexts, 3rd edition, 2021. URL https://otexts.com/fpp3/.

[4] Robert Fildes, Mich`ele Hibon, Spyros Makridakis, and Nigel Meade. Generalising about univariate forecasting methods: further empirical evidence. International Journal of Forecasting, 14:339–358, 9 1998. ISSN 01692070. doi: 10.1016/S0169-2070(98)00009-0.

[5] Simon Stevenson. A comparison of the forecasting ability of arima models. Journal of Property Investment & Finance, 25:223–240, 5 2007. ISSN 1463-578X. doi: 10.1108/14635780710746902.

[6] Charisios Christodoulos, Christos Michalakelis, and Dimitris Varoutas. Forecasting with limited data: Combining arima and diffusion models. Technological Forecasting and Social Change, 77:558–565, 5 2010. ISSN 00401625. doi: 10.1016/j.techfore.2010.01.009.

[7] David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36:1181–1191, 2020. ISSN 0169-2070. doi: https://doi.org/10.1016/j.ijforecast.2019.07.001. URL https://www.sciencedirect.com/science/article/pii/ S0169207019301888.

[8] Eoin Brophy, Zhengwei Wang, Qi She, and Tom´as Ward. Generative adversarial networks in time series: A systematic literature review. ACM Computing Surveys, 55:1–31, 10 2023. ISSN 0360-0300. doi: 10.1145/3559540.

[9] Zhihao Jia, Sina Lin, Charles R Qi, and Alex Aiken. Exploring the hidden dimension in accelerating convolutional neural networks, 2018. URL https://openreview.net/forum?id=SJCPLLpaW.

[10] Weizheng Xu, Youtao Zhang, and Xulong Tang. Parallelizing dnn training on gpus: Challenges and opportunities. pages 174–178. ACM, 4 2021. ISBN 9781450383134. doi: 10.1145/3442442.3452055.

[11] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. volume 30. Curran Associates, Inc., 2017. URL https://papers.nips.cc/paper_files/paper/2017/ hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.

[12] Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto-correlation for longterm series forecasting. 2021. URL https://openreview.net/forum? id=J4gRj6d5Qm.

[13] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wan Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. 2020. URL https://api. semanticscholar.org/CorpusID:229156802.

[14] Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. 2023. URL https://openreview.net/forum? id=Jbdc0vTOcol. [15] Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Unified training of universal time series forecasting transformers. 2024. URL https://openreview.net/forum? id=Yd8eHMY1wz.

[16] Yunhao Zhang and Junchi Yan. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=vSVLM2j9eie.

[17] Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. itransformer: Inverted transformers are effective for time series forecasting. 2024. URL https://openreview. net/forum?id=JePfAI8fah.

[18] Romain Ilbert, Ambroise Odonnat, Vasilii Feofanov, Aladin Virmaux, Giuseppe Paolo, Themis Palpanas, and Ievgen Redko. SAMformer: Unlocking the potential of transformers in time series forecasting with sharpness-aware minimization and channel-wise attention. In Fortyfirst International Conference on Machine Learning, 2024. URL https:// openreview.net/forum?id=8kLzL5QBh2.

[19] Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoderonly foundation model for time-series forecasting. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview. net/forum?id=jn2iTJas6h.

[20] Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A survey of transformers. CoRR, abs/2106.04554, 2021. URL https://arxiv.org/ abs/2106.04554.

[21] Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, Jasper Zschiegner, Danielle C. Maddix, Hao Wang, Michael W. Mahoney, Kari Torkkola, Andrew Gordon Wilson, Michael Bohlke-Schneider, and Yuyang Wang. Chronos: Learning the language of time series, 2024. URL https:// arxiv.org/abs/2403.07815.

[22] Azul Garza and Max Mergenthaler-Canseco. Timegpt-1, 2023.

[23] Kashif Rasul, Arjun Ashok, Andrew Robert Williams, Arian Khorasani, George Adamopoulos, Rishika Bhagwatkar, Marin Biloˇs, Hena Ghonia, Nadhir Hassen, Anderson Schneider, Sahil Garg, Alexandre Drouin, Nicolas Chapados, Yuriy Nevmyvaka, and Irina Rish. Lagllama: Towards foundation models for time series forecasting. In R0- FoMo:Robustness of Few-shot and Zero-shot Learning in Large Foundation Models, 2023. URL https://openreview.net/forum?id=jYluzCLFDM.

[24] Nate Gruver, Marc Anton Finzi, Shikai Qiu, and Andrew Gordon Wilson. Large language models are zero-shot time series forecasters. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=md68e8iZK1.

[25] Alec Radford and Karthik Narasimhan. Improving language understanding by generative pre-training. 2018. URL https://api. semanticscholar.org/CorpusID:49313245.

[26] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID: 160025533.

[27] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On layer normalization in the transformer architecture, 2020. URL https:// openreview.net/forum?id=B1x8anVFPr.

[28] Biao Zhang and Rico Sennrich. Root Mean Square Layer Normalization. In Advances in Neural Information Processing Systems 32, Vancouver, Canada, 2019. URL https://openreview.net/references/pdf? id=S1qBAf6rr.

[29] Noam Shazeer. Glu variants improve transformer, 2020. URL https:// arxiv.org/abs/2002.05202.

[30] Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. On the relationship between self-attention and convolutional layers. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https:// openreview.net/forum?id=HJlnC1rKPB. [

31] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.

[32] Roshan M Rao, Jason Liu, Robert Verkuil, Joshua Meier, John Canny, Pieter Abbeel, Tom Sercu, and Alexander Rives. Msa transformer. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8844–8856. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/rao21a.html.

[33] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luci´c, and Cordelia Schmid. Vivit: A video vision transformer. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6816–6826, 2021. doi: 10.1109/ICCV48922.2021.00676.

[34] Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2021.

[35] Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable transformer. In ACL 2023, December 2022. URL https://www.microsoft.com/en-us/research/publication/ a-length-extrapolatable-transformer/.

[36] Abhimanyu Das, Weihao Kong, Andrew Leach, Shaan K Mathur, Rajat Sen, and Rose Yu. Long-term forecasting with tiDE: Time-series dense encoder. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=pCbC3aQB5W.

[37] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.

[38] D. Peel and G.J. McLachlan. Robust mixture modelling using the t distribution. Statistics and Computing, 10(4):339–348, 2000.

[39] Mika Meitz, Daniel P. A. Preve, and Pentti Saikkonen. A mixture autoregressive model based on student’s t–distribution. Communications in Statistics - Theory and Methods, 52:499 – 515, 2018. URL https://api. semanticscholar.org/CorpusID:73615847.

[40] C. S. WONG, W. S. CHAN, and P. L. KAM. A student t -mixture autoregressive model with applications to heavy-tailed financial data. Biometrika, 96(3):751–760, 2009. ISSN 00063444, 14643510. URL http:// www.jstor.org/stable/27798861.

[41] Taesung Kim, Jinhee Kim, Yunwon Tae, Cheonbok Park, Jang-Ho Choi, and Jaegul Choo. Reversible instance normalization for accurate timeseries forecasting against distribution shift. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum? id=cGDAkQo1C0p.

[42] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.

[43] Datadog. Querying, 2024. URL https://docs.datadoghq.com/ dashboards/querying/.

[44] Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. Timesnet: Temporal 2d-variation modeling for general time series analysis. In International Conference on Learning Representations, 2023.

[45] Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? Proceedings of the AAAI Conference on Artificial Intelligence, 37(9):11121–11128, Jun. 2023. doi: 10.1609/aaai. v37i9.26317. URL https://ojs.aaai.org/index.php/AAAI/article/ view/26317.

[46] Minhao LIU, Ailing Zeng, Muxi Chen, Zhijian Xu, Qiuxia LAI, Lingna Ma, and Qiang Xu. SCINet: Time series modeling and forecasting with sample convolution and interaction. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/ forum?id=AyajSjTAzmg.

[47] Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Proc. 39th International Conference on Machine Learning (ICML 2022), 2022.

[48] J. Scott Armstrong. Long-range Forecasting: From Crystal Ball to Computer. John Wiley & Sons, New York, 1985. ISBN 9780471822608.

[49] R. J Hyndman and A. B. Koehler. Another look at measures of forecast accuracy. International Journal of Forecasting, 22, 2006.

[50] Datadog. Bits ai: Reimagining the way you run operations with autonomous investigations, 2024. URL https://www.datadoghq.com/ blog/bits-ai-autonomous-investigations.

Appendix

A.1 Model architecture

A.2 Results

Authors:

(1) Ben Cohen ([email protected]);

(2) Emaad Khwaja ([email protected]);

(3) Kan Wang ([email protected]);

(4) Charles Masson ([email protected]);

(5) Elise Rame ([email protected]);

(6) Youssef Doubli ([email protected]);

(7) Othmane Abou-Amal ([email protected]).


This paper is available on arxiv under CC BY 4.0 license.