sia.hackernoon.com

Authors:

(1) Harrison Mateika, Northwestern University ([email protected]);

(2) Juannan Jia, Northwestern University ([email protected]);

(3) Linda Lillard, Northwestern University ([email protected]);

(4) Noah Cronbaugh, Northwestern University ([email protected]);

(5) Will Shin, Northwestern University ([email protected]).

Table of Links

3. Data Collection

The data was collected from Kaggle data sets. It contained 10 years of Taiwanese company data with financial information for 6,819 companies between the years 1999 to 2009, published by the Taiwan Economic Journal. The data set includes 95 financial data ratios, ratios regarding corporate governance, and a bankruptcy indicator (1=bankrupt, 0=non-bankrupt). Corporate bankruptcy was defined by the business rules of the Taiwan Stock Exchange (Wang and Liu 2021). We found no non-null values, and the data types were either integers (int64) or floats (float64).

Upon the initial exploratory data analysis, we uncovered that, like most data containing bankruptcy data, the data contained in the Taiwanese company bankruptcy data set is highly skewed towards financially stable (non-bankrupt) companies. Training our models using the data set will likely produce overfitting and biased results toward non-bankrupt companies.

As seen on the graph, there is a large data imbalance between bankrupt and non-bankrupt companies in this data set. This will pose a problem as most machine learning algorithms will not work well with imbalanced data sets. The training model would not learn enough about the bankrupt company data if proper sampling techniques are not applied.

To address the data set imbalance, we created a separate data set representing more balanced data across bankrupt and non-bankrupt companies. We used SMOTE (Synthetic Minority Oversampling Technique) to create an oversampling data set. SMOTE works by selecting data close to the feature space, drawing a line between the examples in the feature space, and drawing a new sample at the point along that line. (Brownlee 2020) We used the oversample.fit_resample() method to create a new oversampled data file containing 26,396 perfectly balanced records with 13,198 bankrupt data sets and 13,198 non-bankrupt data sets.

4. Data Analysis.

During the exploratory data analysis (EDA) stage, we used a correlation heatmap matrix to identify the ten highest correlated features to bankruptcy. Out of the ten, we found the five commonly high features correlated with bankruptcy to be Debt Ratio Percentage, Current Liability to Assets, Current Liability to Current Assets, Total Expense / Assets, and Cash / Current Liability. We then compared both the top positive and negative correlated features, we found that organizations that possess more assets and earnings are healthier and less likely to be bankrupt.

Other observations during the EDA process included:

❖ Most of the features have outliers. Median will be a better analysis method and, also, taking some outliers out will be a good idea when building the model.

❖ Companies with a low ‘Net profit before tax/Paid-in capital’, ‘Persistent EPS’, and ‘Net Value Per Share (A)’ tend to go bankrupt.

❖ ‘Borrowing dependency’ has bankrupt companies distributed throughout all its range. However, around 0.4 are located in the companies that do not go bankrupt. Having around 0.4 does not guarantee to be bankrupt safe since many companies went bankrupt with this index, but having a higher o lower index seems critical since there are not any companies operating with this kind of index.

❖ 0.8 “Net Income to Stockholder’s Equity” is an excellent indicator to operate but does not entirely save you from bankruptcy.

❖ The number of organizations that have gone bankrupt in 10 years between 1999 – 2009 is few.

❖ Very few organizations with negative income have suffered from bankruptcy in the past two years.

❖ An increase in the values of the attributes that negatively correlate with the target attribute helps an organization avoid bankruptcy.

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

Data Collection and Analysis for Machine Learning‑Based Bankruptcy Prediction

Table of Links

3. Data Collection

4. Data Analysis.