Authors:
(1) Guang-Yih Sheu, Department of Innovative Application and Management/Accounting and Information System, Chang-Jung 6 Christian University, Tainan, Taiwan and this author contributed equally to this work ([email protected]);
(2) Nai-Ru Liu, Department of Accounting and Information System, Chang-Jung Christian University, Tainan, Taiwan ([email protected]).
Editor's note: this is part 2 of 3 of a study exploring how AI-powered sampling can help auditors handle large datasets. Read the rest below.
Table of Links
- Abstract and 1. Introduction
- 2. Literature review
-
- Naive Bayes classifier
-
- 4. Results
-
- Discussion
-
- Conclusions and References
-
2. Literature review
As stated earlier, only some studies have sampled data using a machine learning algorithm in auditing. This sparsity leads to harassment in searching for advice to implement this study.
If the purpose is to improve the efficiency of auditing, some published studies (e.g., [5]) integrated machine learning with sampling for detecting anomalies. For example, Chen et al. [5] selected the ID3, CART, and C4.5 algorithms to find anomalies in financial transactions. Their results indicated that a machine learning algorithm can simplify the audit of financial transactions by efficiently exploring their attributes.
Schreyer et al. [8,9] constructed an autoencoder neural network to sample journal entries in their two papers. They fed attributes of those journal entries into the resulting autoencoder. However, Schreyer et al. plotted figures to describe the representatives of samples.
Lee [10] built another autoencoder neural network to sample taxpayers. Unlike Schreyer et al. [8,9], Lee calculated the reconstruction error to quantify the representativeness of samples. This metric measures the difference between input data and outputs reconstructed using samples. Lower reconstruction errors indicate better representativeness of original taxpayers. Besides, Lee [10] used the Aprior algorithm to find those taxpayers who may be valuable to sample together. If one taxpayer breaks some laws, other taxpayers may also be fraudulent.
Chen et al. [11] applied the random forest classifier, XGBoost algorithm, quadratic discriminant analysis, and support vector machines model to sample attributes of Bitcoin daily transaction data. These attributes contain the property and network, trading and market, attention, and gold spot prices. The goal of this previous research is to predict Bitcoin daily prices. Chen et al. [11] found that machine learning algorithms predicted more accurately Bitcoin 5-minute interval prices than statistical methods did.
Different from the above-mentioned four studies, Zhang and Trubey [3] designed under-sampling and over-sampling methods to highlight rare events in a money laundering problem. Their goal was improving the performance of machine learning algorithms in modeling money laundering events. Zhang and Trubey [3] adopted the Bayes logistic regression, decision tree, random forest classifier, support vector machines model, and artificial neural network.
In fields other than auditing, three examples are listed: Liberty et al. [12] defined a specialized regression problem to calculate the probability of sampling each record of a browse dataset. The goal was to sample a small set of records over which evaluating aggregate queries can be done both efficiently and accurately. Deriving their solution to the regression problem employs a simple regularized empirical risk minimization algorithm. Liberty et al. [12] concluded that machine learning integration improved both uniform and standard stratified sampling methods.
Hollingsworth et al. [13] derived generative machine learning models to improve the computational efficiency in sampling high-dimensional parameter spaces. Their results achieve orders of magnitude improvements in sampling efficiency compared to a bruteforce search.
Artrith et al. [14] combined a genetic algorithm and a specialized machine-learning potential based on artificial neural networks to quicken the sampling of amorphous and disordered materials. They found that machine learning integration decreased the required calculations in sampling.
Other relevant studies discussed the benefits or challenges of integrating a machine learning algorithm with the audit of data. These studies only encourage or remind the current study to notice these benefits or challenges. For example, Huang et al. [15] suggested that a machine learning algorithm may serve as a ’Black Box’ to help an auditor. However, auditors may need help in mastering a machine learning algorithm. Furthermore, auditors may have a wrong understanding of the performance of a machine learning algorithm. This misunderstanding causes auditors to believe we can always obtain accurate classification or clustering of data using a machine learning algorithm. Besides, it improves effectiveness and cost efficiency, analyzes massive data sets, and reduces time spent on tasks. Therefore, we should ensure the performance of a machine learning algorithm is sufficiently good before applying it to aid auditors’ work.
3. Naive Bayes classifier
Regarding conventional sampling methods [7], this study designs user-based and item-based approaches in integrating Equations (3)-(4) with the selection of audit evidence:
i. User-based approach: In an attempt to generate unbiased representations of data, classifying (X1, C1),(X2, C2). . . ,(XN, CN) and compute two percentile symmetric around the median of each class according to an auditor’s professional preferences. Draw the X1, X2, . . . , XN bounded by the resulting percentiles as audit evidence, and
ii. Item-based approach: Suppose the Xj , Cj (1 ≤ j ≤ N) represent risky samples. Asymmetrically sample them based on the Pr Ci |Xj (1 ≤ i ≤ N) values as audit evidence after classifying (X1, C1),(X2, C2). . . ,(XN, CN:.
3.1. User-based approach
Regarding existing audit sampling methods [4], the present user-based approach may be identical to a combination of the monetary and variable sampling methods.
3.2. Item-based approach
Further simplifying Equation (10) results in
Regarding existing audit sampling methods [4], the present item-based approach may be equivalent to a combination of non-statistical and monetary sampling methods.
Like Section 3.1, we calculate the representativeness index RI [3] to check whether audit evidence is sufficiently representative.
3.3. Hybrid approach
Auditors may hybridize the resulting works in Sections 3.1-3.2 to balance representativeness and riskiness. We first apply the user-based approach to sample representative members bounded by two percentiles symmetric around the median of a Ci (1 ≤ i ≤ N) class. Applying the item-based approach to sample asymmetrically risker samples is next performed among those resulting representative samples.
This paper is available on arxiv under Attribution-NonCommercial-ShareAlike 4.0 International license.