Predicting and explaining corruption countries in the MENA region .A machine learning approach

Predicting and explaining corruption countries in the MENA region .A machine learning approach

Farjallah Nadia[1], Zaouali Naima[2]

Abstract

Corruption is still pervasive and is seen as one of the great challenges of modern societies. Many academic studies have attempted to identify and explain the causes and potential consequences of corruption, primarily through theoretical lenses using correlations and regression-based statistical analyses. The present study approaches the phenomenon from the predictive analytics perspective by employing contemporary machine learning techniques to discover the most important corruption perception predictors based on enriched/enhanced nonlinear models with a high level of predictive accuracy. Specifically, within the regression modeling setting that is employed herein, the Random Forest (an ensemble-type machine learning algorithm) is found to be the most accurate prediction model, followed by Gradient Boosting and CART. Practically, the increased predictive power of machine learning algorithms Coupled with a multi-source database revealed the most applicable corruption-related information, contributing to the body of knowledge, generating actionable information for administrators, academics, citizens and politicians. The variable importance results indicated that  Trading across Borders, Time – Men (days), Getting Electricity and Cost (% of warehouse value) are the most influential factors in defining the corruption level of significance.

  1. Introduction 

Since the early 1980s, corruption has been at the heart of international policy and development debates. This corruption is defined by most economists as the abuse of public office for private gain. This corruption has affected all countries, especially developing countries. This corruption can influence all countries, rich or poor, democratic or non-democratic, etc. Corruptive behaviour in politics limits economic growth, embezzles public funds, and promotes socio-economic inequality in modern democracies. Over the past 27 years, Ribeiro et al (2018) analyse well-documented political corruption scandals in Brazil, focusing on the dynamical structure of networks where two individuals are connected if they were involved in the same scandal.

Generally, corruption is a serious crime that weakens the state. According to the international transparency “corruption comes from the behavior of public sector agents, whether politicians or civil servants, who enrich themselves or their relatives in an illicit way, through the abuse of public powers entrusted to them”. Thus, numerous research studies by the World Bank have shown that the causes of corruption can be classified into four categories: economic, social, political and institutional (Banque Mondiale, 2002[3]). Furthermore, the causes of corruption have been extensively explored through the use of regression-based statistical analysis, the results remain unclear or at least insufficient to support conclusions.

Many researchers agree that there exists an enormous deleterious effect of corruption on economies. such as, Nuijten and Anders (2017) takes a stance that differs in three fundamental ways from current perspectives in academic and public debates about corruption. First, they do not treat the global anti-corruption industry and dominant social-scientist approaches as frameworks of analysis but rather as subjects of anthropological study t. Second, they do not conceive of corruption as an individual act but as a phenomenon that is institutionalized and embedded in the wider matrix of power relations in society. The contextualization of individual acts reveals the systemic and structural dimensions of corruption. Third, and most important, we take distance from the commonly held view that corruption is simply the law’s negation, a vice afflicting the body politic. Corruption is still pervasive and is seen as one of the great challenges of modern societies. Many academic studies have attempted to identify and explain the causes and potential consequences of corruption, primarily through theoretical lenses using correlations and regression-based statistical analyses. This paper examines the phenomenon from the predictive analytics perspective by employing contemporary machine learning techniques to discover the most important corruption perception predictors based on enriched/enhanced nonlinear models with a high level of predictive accuracy. Specifically, within the regression modeling setting that is employed herein, the Random Forest (an ensemble-type machine learning algorithm) is found to be the most accurate prediction model, followed by Gradient Boosting and CART.

The main objective of this analytical study is to reveal the likely factors and their relative importance as predictors of corruption in the MENA region. To obtain reliable results, we use modern and most popular machine learning algorithms. Use of artificial intelligence and machine learning techniques in detecting and understanding governmental fraud and corruption has been gaining popularity in literature in the recent years (Stockemer, 2018; Sun & Medaglia, 2019; Tang et al., 2019). According to de Sun and Medaglia (2019)there is a growing interest for studies involving artificial intelligence in the public sector. This study attacks corruption with modern machine learning techniques, potentially reinforcing government strategies toward a cleaner society. The main objective of this analytical study is to reveal the likely factors and their relative importance as predictors of corruption in the MENA region. To obtain reliable results, we use modern and most popular machine learning algorithms. This rest of the paper is organized as follows. Section 2 provides literature review on  corruption. Section 3 summarizes Data acquisition and data preprocessing.Section 4 presents  results and discussions. The last section conclusions.

  • Review of the literature on the theory of corruption

In the literature it is well established by now that corruption not only erodes state legitimacy, but also comes at a substantial economic and social cost to societies. Studies show that corruption negatively affects both public revenues (Aghion et al., 2016; Besley & Persson, 2014) and public expenditures (Mauro, 1998), thus limiting the State’s ability to carry out its functions. It is also found to lead to lower quality of public investment (Iliopulos & Arnone, 2007) and less private investment (Al-Sadig, 2010; Godinez & Liu, 2015). What is more, corruption can contribute to poor social and environmental outcomes. It causes inefficiencies in public service provision (Reinikka & Svensson, 2006) and can deprive a country of its human capital by fostering emigration to places that are being perceived as more meritocratic (Cooray & Schneider, 2016). Corruption can undermine the enforcement of environmental regulations leading to increased pollution (Pellegrini, 2011) and overextraction of natural resources (OECD, 2012). Furthermore, data disaggregated by gender has shown that women tend to suffer more from the negative consequences of corruption than men (Transparency International, 2010).

Over the last quarter century, donors have directed large amounts of resources toward anti-corruption efforts around the world. Reflecting on the achievements of these efforts, many corruption scholars find it difficult to trace major positive results from the anticorruption programming that the World Bank and other international development organizations have launched since the mid-1990s (Rothstein, 2018; Hough, 2017; Mungiu-Pippidi, 2015a; Heeks & Mathisen, 2012). Overall, the empirical picture available corroborates this bleak view on the achievements of international anticorruption efforts. The most widely used global indices that include measurements of corruption, such as the Corruption Perception Index (CPI), the Worldwide Governance Indicators (WGI), and the International Country Risk Guide (ICRG), all show that, on a global scale, corruption remains about as prevalent today as it was when the global anticorruption agenda started twenty-five years ago.

Institutional corruption occurs when an institution or its officials receive a benefit that is directly useful to performing an institutional purpose, and systematically provides a service to the benefactor under conditions that tend to undermine procedures that support the primary purposes of the institution (Thompson 2013). This theorist who have taken this turn call attention to the Institutional corruption does not receive the attention it deserves partly because it is so closely (and often unavoidably) related to conduct that is part of the job of a responsible official, the perpetrators are often seen as (and are) respectable officials just trying to do their job, and the legal system and public opinion are more comfortable with condemning wrongdoing that has a corrupt motive. Miller (2011) contends that institutional corruption necessarily involves corrupt individuals who are conscious of their deleterious behavior. There seems to be some confusion in the extant literature regarding boundary limits between the concepts of institutional and individual corruption. According by Thompson (2018), normative theorists of corruption have developed an institutional conception that is distinct from both the individualist approaches focused on quid pro quo exchanges and other institutional approaches found in the literature on developing societies. These theorists emphasize the close connection between patterns of corruption and the legitimate functions of institutions. Also, institutional corruption does not require that its perpetrators have corrupt motives, and it is not limited to political institutions.

Few papers examine the relation between economic freedom and corruption. The empirical results of these studies are consistently the same: the more freedom, the lower the level of corruption, implying that economic freedom is a deterrent to corruption (Chafuen and Guzmán, 2000, Paldam, 2002). In this sense, Graeff and  Mehlkop (2003) imply that there is a strong relation between economic freedom and corruption. This relation depends on a country’s level of development. They identify a stable pattern of aspects of economic freedom influencing corruption that differs depending on whether countries are rich or poor.

Moreover, a large body of academic studies has attempted to identify and explain the potential causes and consequences of corruption, at varying levels of granularity, mostly through theoretical lenses by using correlations and regression-based statistical analyses. According by Marcio and Dursun (2020), the phenomenon from the predictive analytics perspective by employing contemporary machine learning techniques to discover the most important corruption perception predictors based on enriched nonlinear models with a high level of predictive accuracy.

  • Data acquisition and data preprocessing

 The data was acquired from several sources, including Ease of Doing Business Indexes[4],Transparency International[5],the Human Development Reports of the United Nations Development program[6] and the World Bank[7] for the year 2019 and 2020 from 17 countries in the MENA region (annex1). Transparency International is a global movement that works in over 100 countries to end the injustice of corruption. They work to expose the systems and networks that allow corruption to flourish, demanding greater transparency and integrity in all areas of public life. The Corruption Perception Index (CPI) sorts 180 countries according to their perceived corruption levels. The index captures the assessments of domain experts on corrupt behavioral information, originating a scale from 0 to 100 where economies close to 0 are perceived as highly corrupt while economies close to 100 are perceived as less corrupt. The Human Development Report aims to support many country-related analyses by collecting and exploring data from many countries. the Education Index, represents an average of mean years of schooling of adults and expected years of schooling of children. The index uses a scale related to the corresponding maxima[8].

The Doing-Business project contains measures of trade regulations and the efficiency with which these regulations are actually enforced, provides data from 190 countries. It has been used by a number of academic studies in several research areas such as politics, economics and law (Roe and Siegel, 2009; Dixit, 2009 ). The information provided by the Ease of Doing Business Project are mainly related to starting a business, paying taxes, trading across borders, getting credit, and registering property. The Ease of Doing Business ranking system assesses economies by comparing the distance to frontier score to benchmark countries regarding regulatory best practices. Essentially, it is based on systematic comparisons between countries with a baseline that is drawn from the country with the best and most efficient economic practice for a certain indicator. On the 2019 Doing Business questionnaire, survey answers are collected from over 13,000 local experts, including public officials, lawyers, and business specialists. A sample item of a binary variable is the Quality Control Before Construction – whether licensed or technical experts approve building plans. For this type of variable where the only possible scores are 0 and 1, the output is converted for a string-type variable before computing model results. A sample item that is treated as a continuous variable is the Time Required to Complete Each Procedure – Registering Property. For this type of variable, the score captures the median time that field professionals take to complete all necessary procedures to register properties. Scores are computed in calendar days. The distance to frontier score takes into account the simple average of scores on indicators and measures how far or how close an economy is to the most efficient practice score. For more details on Ease of Doing Business methodology, please visit their official website.[9]

The World Bank is a financial institution that provides loans and grants. With 189 member states, staff from over 170 countries and over 130 branches around the world, the World Bank Group is an unparalleled partnership: five institutions working together to find lasting solutions to reduce poverty and promote the sharing of prosperity in developing countries. Government Effectiveness captures perceptions of the quality of public services, the quality of the civil service and the degree of its independence from political pressures, the quality of policy formulation and implementation, and the credibility of the government’s commitment to such policies. Regulatory Quality captures perceptions of the ability of the government to formulate and implement sound policies and regulations that permit and promote private sector development. Rule of Law captures perceptions of the extent to which agents have confidence in and abide by the rules of society, and in particular the quality of contract enforcement, property rights, the police, and the courts, as well as the likelihood of crime and violence. Voice and Accountability captures perceptions of the extent to which a country’s citizens are able to participate in selecting their government, as well as freedom of expression, freedom of association, and a free media. the score for each variable varies between -2.5 and 2.5. Table 1 represents the names and descriptive statistics (including the Shapiro-Wilk p-value for normality test) of all variables  ncluded in this analytics study. In general, authors have shown (Delen et al., 2012; Sharda et al., 2017) that machine learning algorithms are able to extract deeply hidden knowledge from large datasets involving several types of input variables that are not necessarily normally distributed. Unlike traditional stochastic data models, the machine learning community assumes that, in nature, data are generated in complex ways that are not necessarily normally distributed or linearly correlated. The only assumption for algorithmic cultivation is that data generated by natural processes follow an unknown multivariate distribution (Breiman, 2001).  Our research contributes to the literature on corruption by using state-of-the-art machine learning techniques to discover the most important predictors of perceived corruption in economies. Fig 1 uses average CPI scores from the year 2019 to create a color scale (from red to blue) to illustrate and differentiate high and low levels of corruption across countries.

Fig.1. Corruption Perception Index across countries. Dark yellow countries (higher CPI scores) relate to less corrupt economies while dark red countries (low CPI scores) relate to more corrupt economies. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article).

3.   Machine learning algorithms

The purpose of machine learning is to analyze the information available from a large number of statistical data in order to learn how to carry out a study without having been specifically programmed for it beforehand. She has had great success in recent years. These flexible machine learning techniques have the potential to capture complex nonlinear, interactive selection patterns, yet to our knowledge, their performance in analyzing missing context data has never been evaluated. The forecasts mediocre and associated uncertainties models can increase construction costs[10]. These costs can be reduced by predicting the stability number closer to actual conditions[11]. Therefore, a meaningful robust model can be both cost-effective and time-saving. For this reason, advanced techniques of machine learning algorithms have been studied to overcome all the limitations mentioned above.

3.1. Random Forest

   Random forest is an efficient and widely used tree-based ensemble learning method. Random forest algorithms depend in concept on bagging[12]. It uses sampling with replacement and develops multiple decision trees from the training dataset. The decision, obtained from the maximum number of trees, is considered as the final result of the corresponding random forest model (RFM)[13]. This allows to reduce the prediction deviation with respect to an individual tree. Therefore, the model becomes more general.  At first,a bootstrapped dataset is created by choosing random inputs from the main dataset and a decision tree is built based on this dataset. In the same way, other trees are developed. Specifically, each tree depends on a random value of the number of predictors that was chosen as the split inputs at each iteration. As a result of this, the best distribution is found when building a tree. This randomness decreases the variance of the forest estimate. Generally, the individual decision tree tends to overfit and thus get a high variance in the prediction. Random forest produces reduced variance with its wide variety of trees and shows high accuracy, as it combines single trees and takes an average of their predictions. Out-of-bag observations, which were not retained in the sampling process, are used in random forest to estimate the error rate [14]. Random forest model performs well in prediction for both the numerical and categorical variables and it can estimates the missing values in the dataset.Moreover, it has the ability to handle complex interactions and noisy variables. The schematic idea of the random forest algorthims is shown in Fig. 2.

Fig.2. schematics of random forest algorithms

3.2.   Gradient Boosting

  Generalized, gradient boosting is a  tree boosting procedure, composées de fonctions de perte différentiables arbitraires. Il a été reconnu comme une technique d’apprentissage automatique efficace pour les problèmes de classification et de régression. Le modèle de régression est utilisé pour prédire la valeur continue[15]. Gradient boosting regressioruses a number of decision trees of fixed size and builts an additive model. This is the main difference between gradient boosting regressior and the conventional AdaBoost model. Usually, decision stumps with one node and two leaves are used in AdaBoost. The model fitting procedure begins with a leaf as the mean value of the target variable. Then, a tree is added based on the obtained residuals and the contribution of the tree is scaled in the next step, with a learning rate until the final estimate. The other trees are added considering the new residuals, based on the error obtained from the previous trees. In this way, the decision trees are adjusted to estimate the negative gradient of the samples and for each subsequent estimator, the gradients are updated in each of the operators. The schematic idea of gradient-boosting regression algorithms is illustrated in Fig. 3

Fig. 3. A schematic of the gradient boosting regression algorithms

3.3.    CART ( Classification And Regression Trees)

The model is the first to use regression trees for the purpose of machine learning. It is very simple, and is not very efficient, however it serves as the basis for several more elaborate machine learning models, such as the bagging and random forest models.

* Maximum tree: If the decision tree is a regression tree, the variable to be explained is a continuous variable. We want to know the value of a quantity of interest.On note :

-Y : The answer variable.

– p : The number of covariates.

– , 1 : the covariates.

– : The amount of interest to be predicted.

x]

 The expectation being the solution of the minimization of the quadratic error, the quantity of interest chosen is the solution of the following equation:

X=x]

With :

To build the tree, the goal is to segment each node into two child nodes, minimizing the variance of the two new nodes. At each constructed node, the new estimator of E[Y] becomes the empirical expectation of all observations of the node.

3.4.    Testing and evaluating – cross-validation

For these methods, we used k-fold cross-validation to randomly divide the data into k number of mutually exclusive subsets for “training” and “testing” sets. The folds (k-1) of the data are used to build the model and the remaining fold is used to test the model. Delen et al. (2012) proved that a single random assignment can potentially lead to heterogeneous subsets of data which, in turn, would produce biased results. For this reason, we used five rounds (k = 10) of cross-validation on the entire dataset. In each round of the 5-fold cross-validation, the model is trained in all except one of the folds and tested on the excluded fold, which is the test subset for that specific round. Finally, the average of the results of the five rounds is compiled for the final analysis. Olson and Delen (2008) report that the use of stratified cross-validation tends to decrease bias compared to regular cross-validation. According to Delen et al. (2012), overall accuracy is measured using the average of each individual k accuracy measurement.

3.5.  Model Performance Comparison

To evaluate and compare the prediction performance of the four machine learning algorithms, we chose to use the mean squared error (MSE), which is defined as the mean for each individual of the deviation test basis squared between the prediction of the model to be tested and the true output value. For each model, we will evaluate the following quantity:

Where :

– m : the number of models to test.

– n : the number of individuals in the initial base.

–  : the number of individuals in the test database.

–  : the actual output variable of the individual.

–  : the output variable of individual i predicted by model j.

We are interested in the model which has : min . The model that minimizes the MSE error appears to be the model that best predicts the dependent variable.

  • Results and discussions

  The objective is to build artificial intelligence models to predict corruption.  To achieve this, we used three machine learning approaches Random Forest (RF), Gradient Boosting (GB) and CART. The training and test data were split, with 70% of the data used to train the model and 30% used to examine model efficiency and accuracy.To compare the performance of the models, we use the mean square error (MSE). In general, the smaller the error, the more accurate the models appear to be.Table 2 shows that the learning method Random Forest  improves the accuracy of the prediction method, achieving an overall accuracy of 99.66%. Gradient Boosting (GB) was the second best model with an overall accuracy of 93.79% and CART with 92.30%. The Random Forest method remains a “black box” method since we cannot visualize the decision tree allowing to obtain the final prediction. This type of model gives great predictive performance, is versatile and detects interactions without having to specify a parametric form (Hamza and Larocque, 2005). Graphs (4) explain the mechanism of the Random Forest algorithm to avoid overfitting. Generally speaking, the root mean square error of the training and test data will decrease. To assess the most influential factors on corruption in the MENA region, we use the random forest-based predator importance technique.

Fig 4. Summary of random forest response.

Fig 5. Importance of variables

Understanding the nature of the data is one of the most important objects in the field of machine learning (ML). An approach to estimating the relative influences of input characteristics on corruption is presented in this study by introducing feature analysis, using ensemble learning models proposed. Erdik[16] mentioned that the success of these forecasting models can be increased by understanding more intensively the stability parameters. Feature selection is a procedure that identifies a subset of the original input data set. This feature selection property is very effective in building a good prediction model and understanding the physical knowledge of the dataset. There are many benefits to a good feature importance analysis. To the best of the author’s knowledge no strong and compact feature analysis is considered in any of the existing literatures related to the corruption. Feature selection can reduce the dimensionality of the prediction problem with an appropriate logic, which will speed up the ML algorithms. It reduces storage requirements and improves the accuracy of the prediction models. The underlying relations between the corruption variables of the corresponding dataset, can be understood even better by identifying the most relavent variables. Therefore, feature analysis provides a deep insight into the dataset and improves the understanding about the data. In addition, the easiest way to meet the storage requirements and achieve the required speed is to keep only those variables in the dataset that have more importance on the response variables than others. Incorporating variable importances into a dedicated iterative feature selection procedure can yield more accurate predictions at the cost of a little computational effort. Trading across Borders (100%), Time – Men (days, 94%), Getting Electricity (85%) and Cost (% of warehouse value, 84%) are the most influential factors in defining the corruption level of significance.

Conclusions

   In this study, we searched for several potential predictors of the Corruption Perceptions Index in 17 MENA countries. We have chosen to use variables provided by the World Bank, Transparency International and the Heritage Foundation for the years 2019 and 2020. After several experiments, we chose the methods Gradient Boosting, Classification and Regression Trees (CART) and Random Forest. The cross-validation results showed that the Random Forest was the most accurate classification method for predicting CPI. Gradient Boosting and CART were, respectively, the second and the third best models showing satisfactory prediction performances. our results prove that Trading across Borders is the most important predictor of corruption. On the other hand, machine learning models, specifically speaking of classification-type algorithms such as Random Forest, have the capability of revealing important predictors regardless of significant linear correlations and complex relationships between the input variables.

Most machine learning algorithms are considered predictive instruments with reduced descriptive capabilities. Although the predictive accuracy of machine learning models has been consistently higher than that of traditional regression models, their process of operation has been referred to as a “black box” by some researchers, because a machine learning model trains, assuming the data is generated in a complex way. necessarily correlated to produce accurate predictions of a certain outcome. In other words, in machine learning, the relationship between input and output variables can only be inferred by heuristic methods of experimentation, with an emphasis on predictive accuracy.

Customs and trade regulations appear to be greater obstacles for businesses in the MENA region than for other countries. Businesses need more time to clear customs to import or export than in other countries. The MENA region depends on high levels of imports compared to low export activities. To ensure sustainable growth in the region’s private sector, the report calls on the MENA region to lower regulatory barriers for business, promote competition, and reduce disincentives resulting from political influence and informal business practices. The region also needs reforms to facilitate innovation, adoption of digital technologies and investments in human capital, while being in line with the global agenda to limit climate change, enhance sustainability and protect the natural environment. Therefore, the lack of explicit theoretical explanations regarding the relationships between variables (strength and direction of influence) can be considered as one of the limitations of predictive analytics studies, including the one proposed in this paper.

Table1. List of variables and their Descriptive Statistics

VariablesObservationsMeanMinMaxStd DevMedianWProb« W
Corruption3441,000015,000071,00016,203342,00000.961970.27737
Education Index  340,70740,35000,9190,12600,72100.969740.45424
Starting a Business3484,097167,000094,8008,542185,40000.919860.1601
Procedure – Men (number) 346,11762,000012,0002,84736,00000.93675<0.04937
Time – Men (days) 3418,35293,500072,00016,885312,00000.72928<0.000
Cost – Men (% of income per capita)3414,18821,000042,30014,03176,30000.82734<0.0000
Procedure – Women (number) 346,76473,000012,0002,81837,00000.93553<0.04544
Time – Women (days) 3419,00004,500073,00016,939313,00000.72231<0.0000
Cost – Women (% of income per capita)3414,18821,000042,30014,03176,30000.82734<0.0000
Paid-in min. capital (% of income per capita) 345,24120,000041,50012,00210,00000.49870<0.000
Dealing with Construction Permits3464,42940,000089,80025,430971,55000.69765<0.000
Procedures (number)3414,79419,000022,0004,058514,00000.91924<0.01537
Time (days) 34120,558847,5000276,00058,0771114,00000.89920<0.00437
Cost (% of warehouse value)343,54410,100012,1003,14483,30000.941440.06807
Building quality control index (0-15) 3411,97064,500015,0002,757712,50000.927110.0258
Getting Electricity3472,35000,0000100,00021,071076,55000.88526<0.00191
Procedures (number)344,47062,00006,0001,10745,00000.93024<0.003179
Time (days) 3457,52947,0000118,00029,361153,00000.90064<0.0477
Cost (% of income per capita)34308,17650,00001308,800388,3218128,00000.889740.00248
Reliability of supply and transparency of tariff index (0-8)345,00000,00008,0002,48636,00000.89125<0.00272
Registering Property3464,96470,000096,20020,576066,50000.956310.18944
Procedures (number) 345,44121,000010,0002,66516,00000.85941<0.00046
Time (days)3425,73531,000076,00020,720119,50000.73077<0.0000
Cost (% of property value)344,17060,00009,0003,05646,00000.89994<0.00457
Quality of the land administration index (0-30)3416,14717,000026,0005,356317,00000.84766<0.00030
Getting Credit3441,15590,000095,00024,213745,00000.86447<0.01796
Strength of legal rights index (0-12) 342,88240,000011,0002,80442,00000.84925<0.01044
Depth of credit information index (0-8) 345,76470,00008,0003,31268,00000.64990<0.00014
Credit registry coverage (% of adults) 3413,52350,000060,30017,06035,00000.83672<0.0000
Credit bureau coverage (% of adults)3425,11760,0000100,00029,107822,90000.64339<0.0001
Protecting Minority Investors3451,823518,000086,00020,105054,00000.93463<0.0005
Extent of disclosure index (0-10) 346,47062,000010,0002,28617,00000.78761<0.02580
Extent of director liability index (0-10)344,70591,000010,0003,01044,00000.81267<0.00191
Ease of shareholder suits index (0-10)344,41181,00009,0001,94034,00000.94144<0.03179
Extent of shareholder rights index (0-6) 343,29410,00006,0001,93114,00000.927110.10382
Extent of ownership and control index (0-7) 343,82350,00007,0002,44294,00000.885260.06560
Extent of corporate transparency index (0-7) 343,76470,00007,0002,45024,00000.930240.06951
Paying Taxes3475,055940,0000100,00016,208174,10000.900640.29476
Payments (number per year) 3414,35293,000044,00010,814912,00000.88974<0.02337
Time (hours per year) 34203,964710,4000889,000201,1199155,00000.89125<0.00038
Total tax and contribution rate (% of profit)3430,876511,300066,10016,234627,40000.95631<0.0046
Postfiling index (0-100)3354,745519,000098,60028,677349,80000.84766<0.0003
Trading across Borders3461,06470,000085,60022,979968,85000.85951<0.00046
Time to export: Border compliance (hours) 3354,33336,0000101,00030,271353,00000.93129<0.03810
Cost to export: Border compliance (USD)33358,242447,00001118,000267,5850319,00000.87663<0.00139
Time to export: Documentary compliance (hours) 3365,45453,0000504,000119,637824,00000.51337<0.0000
Cost to export: Documentary compliance (USD)33227,757650,00001800,000413,3370100,00000.40450<0.0000
Time to import: Border compliance (hours) 3397,515239,0000240,00061,350572,00000.80825<0.0005
Cost to import: Border compliance (USD)33486,7273206,0000790,000174,2753553,00000.92361<0.02310
Time to import: Documentary compliance (hours) 3373,09097,0000265,00064,056860,00000.78742<0.0002
Cost to import: Documentary compliance (USD)33254,363660,00001000,000228,3422144,00000.72430<0.000
Enforcing Contracts3456,591240,000075,9008,112657,75000.954250.16454
Time (days) 34605,70595,00001010,000210,9447598,00000.83153<0.00011
Cost (% of claim value)3437,217614,7000248,00053,726326,20000.33789<0.0000
Quality of judicial processes index (0-18) 347,17651,500014,0003,41986,50000.93427<0.04171
Resolving Insolvency3435,82060,000072,80020,156439,25000.90724<0.00716
Recovery rate (cents on the dollar)3437,741215,000062,60012,816834,15000.93678<0.04949
Time (years) 344,30291,000027,7006,21992,90000.4449<0.0000
Cost (% of estate) 3412,95293,200023,0006,371110,00000.88443<0.00182
Outcome (0 as piecemeal sale and 1 as going concern)341,23530,000020,0004,76770,00000.27292<0.0000
Strength of insolvency framework index (0-16)348,60294,000012,5002,70748,25000.91104<0.00910
rule of law34-0,2342-2,00001,0480,94280,08120.91586<0.01236
regulatory quality34-0,3249-2,34711,2811,0159-0,08410.957450.20474
government effectiveness34-0,2647-2,30001,3770,9857-0,11120.956960.19795
voice and accountability34-0,9384-1,80000,7000,6383-1,11350.87713<0.00120

Note. The Prob « W value listed on the last column is the p-value. The Shapiro-Wilk p-value tests the null hypothesis that the data are normally distributed.

Table 2: Performance analysis  of the ensemble learning based  corruption  prediction models.

ModelMSE
Random Forest0.003413
Gradient Boosting0.062134
CART0.070208

References

  1. Aghion, P., Akcigit, U., Cagé, J. and Kerr, W. (2016). Taxation, Corruption, and Growth. NBER Working Paper 21928, Cambridge, Massachusetts: National Bureau of Economic Research.
  2. Al-Sadig,A.,(2010). Corruption and Private Domestic Investment: evidence from developing countries, International Journal of Economic Policy in Emerging Economies 3(1):47-60, DOI:10.1504/IJEPEE.2010.032794
  3. Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. https://doi.org/10. 1023/A: 1010933404324.
  4. Breiman, L., Friedman, J., Olshen, R. Stone, C. [1984] : Classification And Regression Trees.
  5. Besley, T. and Persson, T. (2014). Why Do Developing Countries Tax So Little? Journal of Economic Perspectives, 28(4), pp. 99–120.

Review, 99, 5–24. https://doi.org/10.1257/aer.99.1.5.

  1. Friedman, T. Hastie, R. Tibshirani, Ann Stat 28 (2) (2000) 337–407.
  2. Graeff, P., & Mehlkop, G. (2003). The impact of economic freedom on corruption: Different patterns for rich and poor countries. European Journal of Political Economy, 19, 605–620. https://doi.org/10.1016/S0176-2680(03)00015-6.
  3. Godinez, Jose R. & Liu, Ling, 2015. “Corruption distance and FDI flows into Latin America,” International Business Review, Elsevier, vol. 24(1), pages 33-42.
  4. Hamza, M.,Larocque,D. 2005. “An Empirical Comparison of Ensemble Methods Based on Classification Trees.” Journal of Statistical Computation and Simulation 75 (8): 629–43. https://doi.org/10.1080/00949650410001729472.
  5. Hough, Dan (2017) Analysing corruption. Agenda, Newcastle upon Tyne. ISBN 9781911116547
  6. Miller, S. (2011). Corruption. In E. N. Zalta (Ed.). Stanford encyclopedia of philosophy http:// plato.stanford.edu/archives/spr2011/entries/corruption/.
  7. Mungiu-Pippidi, A. (2015a). Corruption: Good governance powers innovation. Nature News, 518(7539), 295.
  1. Marcio, S, M, L; Dursun,D.(2020) Predicting and explaining corruption across countries: A machine learning approach Government Information Quarterly Volume 37, Issue 1, January 2020, 101407
  2. Nuijten, M., & Anders, G. (2017). Corruption and the secret of law: An introduction. Corruption and the secret of law (pp. 1–24). New York: Routledge.
  3. Ribeiro, H. V., Alves, L. G., Martins, A. F., Lenzi, E. K., & Perc, M. (2018). The dynamical structure of political corruption networks. Journal of Complex Networks, 6, 989–1003. https://doi.org/10.1093/comnet/cny002.

Annex 1

Algeria

Yemen

Bahreïn

Egypt

Iran

Iraq

Israel

Jordan

Kuwait

Libye

Maroc

Oman

Qatar

Saudia

Tunis

Enirate

Lebanon


[1] LaREMFiC Laboratory, IHEC, University of Sousse Tunisia. / E-mail: nadiafarjallah25@gmail.com , Country: Tunisia, City: Monastir.

[2] EconomiX Nanterre Laboratory of France/ Email : naima_zaouali@yahoo.fr, Country :City: Monastir.

[3] Banque mondiale,2002.Rapport sur le développement dans le monde 2002 : des institutions pour les marchés.Banque mondiale, Washington, DC, Éditions Eska, Paris.

[4] www.doingbusiness.org

[5] www.transparency.org

[6] www.undp.org

[7] www.worldbank.org

[8] hdr.undp.org/en/indicators/103706.

[9]www.doingbusiness.org/methodology

[10] D.H. Kim, W.S. Park, Ocean Eng. 32 (2005) 1332–1349.

[11] T. Erdik, Expert Syst. Applic. 36 (2009) 4162–4170.

[12] G. Louppe, arXiv:1407.7502v3 (2015).

[13] L. Breiman, Mach. Learn. 24 (2) (1996) 123–140

[14] L. Breiman, Mach. Learn. 24 (2) (1996) 123–140.

[15] J.P. Bieman, J.M. Wilms, H. Boogaard, Water (Basel) 12 (6) (2020). 2073-4441(1703)

[16] T. Erdik, Expert Syst. Applic. 36 (2009) 4162–4170.

Leave a Reply

Your email address will not be published. Required fields are marked *