Abstract and 1. Introduction

  1. Related Work

  2. Proposed Dataset

  3. SymTax Model

    4.1 Prefetcher

    4.2 Enricher

    4.3 Reranker

  4. Experiments and Results

  5. Analysis

    6.1 Ablation Study

    6.2 Quantitative Analysis and 6.3 Qualitative Analysis

  6. Conclusion

  7. Limitations

  8. Ethics Statement and References

Appendix

7 Conclusion

In this paper, we present a model for local citation recommendation that leverages the notion of Symbiosis from Biology, and we draw its analogy with human citation behaviour. We propose the notion of taxonomy fusion for learning rich concept representations and project them into hyperbolic space to derive a latent feature. We introduce a novel dataset that is comparatively large, dense, recent and more challenging than other existing datasets. Through several experiments and analyses, we prove our model as highly modular, which can run on datasets with comparatively few signals and accommodate additional signals as well. Our model consistently outperforms SOTA by huge margins for all evaluation metrics across all datasets.

8 Limitations

The current work marks the initial step towards incorporating human behaviour in designing a recommendation system for citation. We show empirically that such an inclusion leads to significant gains in performance. However, additional signals that resemble the actual citation behaviour can be incorporated to yield better performance. In the current setting, our system is limited to work in offline mode. We intend to transform our system to operate in the online setting, providing real-time recommendations.

9 Ethics Statement

Our work focuses on advancing citation recommendation and assisting the researchers in their academic writing process, where we are committed to maintain ethical standards. We will release our curated dataset and it can serve as a large and suitable benchmark for future research. Upholding transparency, our methodologies adhere to ethical guidelines, ensuring the responsible considerations. We assert that our work contributes positively to the citation ecosystem without raising ethical or moral concerns. We remain vigilant in addressing any unforeseen ethical challenges, driven by a commitment to principled research conduct. Our goal is to foster collaboration, uphold privacy, and enhance scholarly discourse.

References

Zafar Ali, Guilin Qi, Khan Muhammad, Pavlos Kefalas, and Shah Khusro. 2021. Global citation recommendation employing generative adversarial network. Expert Syst. Appl., 180(C).

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615– 3620, Hong Kong, China. Association for Computational Linguistics.

Chandra Bhagavatula, Sergey Feldman, Russell Power, and Waleed Ammar. 2018. Content-based citation recommendation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 238–251, New Orleans, Louisiana. Association for Computational Linguistics.

Lutz Bornmann, Robin Haunschild, and Rüdiger Mutz. 2021. Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases. Humanities and Social Sciences Communications, 8(1):1–15.

Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel Weld. 2020. SPECTER: Document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, Online. Association for Computational Linguistics.

Tao Dai, Li Zhu, Yaxiong Wang, and Kathleen M Carley. 2019. Attentive stacked denoising autoencoder with bi-lstm for personalized context-aware citation recommendation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:553–568.

Yao Deng, Xi Zheng, Tianyi Zhang, Chen Chen, Guannan Lou, and Miryung Kim. 2020. An analysis of adversarial attacks and defenses on autonomous driving models. In 2020 IEEE international conference on pervasive computing and communications (PerCom), pages 1–10. IEEE.

Travis Ebesu and Yi Fang. 2017. Neural citation network for context-aware citation recommendation. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, pages 1093–1096.

Octavian Ganea, Gary Bécigneul, and Thomas Hofmann. 2018. Hyperbolic neural networks. Advances in neural information processing systems, 31.

Johannes Gasteiger, Aleksandar Bojchevski, and Stephan Günnemann. 2019. Combining neural networks with personalized pagerank for classification on graphs. In International Conference on Learning Representations.

Nianlong Gu, Yingqiang Gao, and Richard HR Hahnloser. 2022. Local citation recommendation with hierarchical-attention text encoder and scibert-based reranking. In European Conference on Information Retrieval, pages 274–288. Springer.

Lantian Guo, Xiaoyan Cai, Fei Hao, Dejun Mu, Changjian Fang, and Libin Yang. 2017. Exploiting fine-grained co-authorship for personalized citation recommendation. IEEE Access, 5:12714–12725.

Qi He, Jian Pei, Daniel Kifer, Prasenjit Mitra, and Lee Giles. 2010. Context-aware citation recommendation. In Proceedings of the 19th international conference on World wide web, pages 421–430.

Wenyi Huang, Saurabh Kataria, Cornelia Caragea, Prasenjit Mitra, C Lee Giles, and Lior Rokach. 2012. Recommending citations: translating papers into references. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 1910–1914.

Wenyi Huang, Zhaohui Wu, Chen Liang, Prasenjit Mitra, and C Giles. 2015. A neural probabilistic model for context based citation recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29.

Chanwoo Jeong, Sion Jang, Eunjeong Park, and Sungchul Choi. 2020. A context-aware citation recommendation model with bert and graph convolutional networks. Scientometrics, 124:1907–1922.

Rob Johnson, Anthony Watkinson, and Michael Mabe. 2018. The stm report. An overview of scientific and scholarly publishing. 5th edition October, page 94.

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, page 2.

Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Ro{bert}a: A robustly optimized {bert} pretraining approach.

Avishay Livne, Vivek Gokuladas, Jaime Teevan, Susan T Dumais, and Eytan Adar. 2014. Citesight: supporting contextual citation recommendation using differential search. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pages 807–816.

Zoran Medic and Jan Šnajder. 2020. Improved local citation recommendation based on context enhanced with global information. In Proceedings of the first workshop on scholarly document processing, pages 97–103.

Laurent Meunier, Raphael Ettedgui, Rafael Pinot, Yann Chevaleyre, and Jamal Atif. 2022. Towards consistency in adversarial classification. In Advances in Neural Information Processing Systems, volume 35, pages 8538–8549. Curran Associates, Inc.

Gabriela F Nane, Nicolas Robinson-Garcia, François van Schalkwyk, and Daniel Torres-Salinas. 2023. Covid-19 and the scientific publishing system: growth, open access and scientific fields. Scientometrics, 128(1):345–362.

Malte Ostendorff, Nils Rethmeier, Isabelle Augenstein, Bela Gipp, and Georg Rehm. 2022. Neighborhood contrastive learning for scientific document representations with citation embeddings. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11670–11688, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.

Ramit Sawhney, Ritesh Soun, Shrey Pandit, Megh Thakkar, Sarvagya Malaviya, and Yuval Pinter. 2022. Ciaug: Equipping interpolative augmentation with curriculum learning. In NAACL, pages 1758–1764.

Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In International Conference on Learning Representations.

Yifan Wang, Yiping Song, Shuai Li, Chaoran Cheng, Wei Ju, Ming Zhang, and Sheng Wang. 2022. Disencite: Graph-based disentangled representation learning for context-specific citation generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11449–11458.

Qianqian Xie, Yutao Zhu, Jimin Huang, Pan Du, and Jian-Yun Nie. 2021. Graph neural collaborative topic model for citation recommendation. ACM Transactions on Information Systems (TOIS), 40(3):1–30.

A. Appendix

We conduct another quantitative analysis using the section heading as an additional signal in our reranking module.

A.1 Additional Experiment

We concatenate the section heading with query context in reranker and run our two SymTax variants. From Table 6, we can observe that using section heading leads to a significant performance drop in SciBERT_vector for all the metrics. However, for SPECTER_graph, the overall performance remains nearly the same. Both of these patterns clearly indicate that using section heading as a feature acts as a noise, and thus the citation contexts are already rich. Since our proposed dataset contains this additional feature, it is suitable for two additional tasks: context-specific citation generation (Wang et al., 2022), and section heading prediction for a given citation context.

A.2 Implementation Details

A.3 Datasets

ACL-200. This dataset contains papers published at ACL venues. It is a processed version of the ACL-ARC dataset created using ParsCit[12], a string parsing package based on conditional random field.

It contains citation contexts by considering ±200 characters around the citation placeholder.

FullTextPeerRead. It is an expansion of PeerRead dataset that contains the peer reviews of papers submitted to top venues in the Artificial Intelligence domain. So, FullTextPeerRead contains the citation contexts from the papers present in the PeerRead dataset.

RefSeer. This dataset is curated by extracting scientific articles belonging to various engineering domains. A citation excerpt is taken as the text of ±200 characters around the citation marker. It is a large dataset that contains 3.7 million citation contexts.

arXiv (HAtten). It is created using arXiv papers from a large and diverse corpus of scientific articles contained in S2ORC[13]. For every paper having its full text available, a citation excerpt is considered if the cited paper is also present in the arXiv database. Following the similar trend setup by ACL-200 and RefSeer, this dataset is also curated by considering the words in the ±200 character window around the citation marker.

Authors:

(1) Karan Goyal, IIIT Delhi, India ([email protected]);

(2) Mayank Goel, NSUT Delhi, India ([email protected]);

(3) Vikram Goyal, IIIT Delhi, India ([email protected]);

(4) Mukesh Mohania, IIIT Delhi, India ([email protected]).


This paper is available on arxiv under CC by-SA 4.0 Deed (Attribution-Sharealike 4.0 International) license.

[12] https://github.com/knmnyn/ParsCit

[13] https://github.com/allenai/s2orc