Главная Коллекция "Revolution" Транспорт The application of GPS-trackers data to the analysis of spatial behavior patterns of drivers

The application of GPS-trackers data to the analysis of spatial behavior patterns of drivers

Driver behavior and its possible application to the tariff policy of insurance companies. Calculation of the price tariff for the driver, depending on his riskiness. Possibilities of measuring the probability of a driver getting into an accident.

Рубрика	Транспорт
Вид	дипломная работа
Язык	английский
Дата добавления	10.12.2019
Размер файла	2,5 M

посмотреть текст работы

скачать работу можно здесь

полная информация о работе

весь список подобных работ

Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже

Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.

Страница:

Размещено на http://www.allbest.ru/

NATIONAL RESEARCH UNIVERSITY

HIGHER SCHOOL OF ECONOMICS

DIPLOMA

The Application of GPS-trackers Data to the Analysis of Spatial Behavior Patterns of Drivers

Angelina Shkrebets

Research Adviser:

Ivan Stankevich

MOSCOW 2019

Abstract

The paper analyses the driving behavior and its possible applications to insurance companies tariff policy. A price tariff for a driver is designed according to his/her riskiness. Nowadays there are a lot of opportunities to measure driver's probability of getting into accident. The research provides a modern approach using real-time data concerning driving patterns to make the existing system more flexible and transparent.

This can be achieved by evaluating data from GPS-trackers combined with behavioral characteristics of each driver: for instance, his/her reaction when passing by the traffic-lights, speed bumps and etc. The main idea is to measure driving style as the driver's behavior before and after passing the particular transport objects. The analysis of this metric can be achieved by implementing machine learning algorithms, which proved to be successful in the identification of car accidents. As a result, a scoring process will become more manageable and adequate.

Introduction

Car accident is one of the most frequent reasons of human mortality, it reached 1.35 million according to The Global status report on road safety (2018)[17]. Moreover, road injuries are the key deaths factor of people aged 5-29 years. Except human lives, it levies substantial economic costs as being tight up with casualties, property damage and extra medical resources.

An agent that may smooth a drastic effect of such situations is an insurance industry. Nowadays the majority of insurance companies gather a comprehensive profile about their customers, and determine insurance tariffs relying on a long list of parameters including age, occupation, psychological estimation, marrige status, car model, price of the car, driving history and sometimes credit history (if an insurance company is connected with a bank). Despite the fact that this procedure enables to get a complete character of a client, it is still fails to evaluate a probability of getting into accident.

The worldwide successful turn towards Data Science and Deep Learning among various industries proves that driver scoring models should be substantially altered by implementing historical and real-time information with the help of advance machine learning technics which can disclose entangled patterns in data and thus predict accidents. All in all, the analysis of features which cause accidents and their monitoring will enable to forecast risk profiles of individual drivers and remold insurance tariffs on weekly and daily basis to prevent tragic situations on the roads. It will also resolve a moral hazards problem, which is fully described in a well-known research by Akerlof (1970)[18], which can be extrapolated on the insurance business in the following way: companies are forced to impose high prices, because clients tend to ride in a more aggressive way after buying an insurance, therefore less prosperous drivers cannot afford them. On the one hand, it is favorable for social welfare, because sans a complete coverage an individual may evade risky situations on the road. On the other hand, it is inadvertent forfeiture of feasible income. The usage of a flexible system relying on tracking data, which will allow to determine an individual tariff for each driver according to his/her driving style and thus risk ranking will resolve the existing problem of missing a market share and extend total revenue as additional sources of cash flow will be generated.

The major idea behind this paper is to create a model upon data from GPS-trackers that will help to estimate a perfect tariff for a driver depending on her driving behavior and risk-appetite. Also, this study will cover a second-order problem concerning the influence external objects have on the driving style of the individuals. Such objects are represented by traffic lights, speed bumps and other transport policy tools. To proceed with the idea Open Street Map will be parsed to get traffic objects and then extract driving patterns near them. Moreover, one of the intermediate steps is to identify key features, which influence driving style and must be comprised in the final decision model. Afterwards decision algorithms which are the best to process the dataset will be selected.

Since the designed algorithm fully depends on data, which must be in a certain format, it can't be applied for every car-tracker. There is a possibility to turn the data into one format. However, it is time-consuming and should be performed in a unified way to show desired results. The model can be implemented when analyzing car-sharing, taxi services, and private cars. The latter can be examined only through aggregated data to get essential statistics and clusters of drivers due to a high level of confidence. Furthermore, all observations are made in Russia, that's why conclusions can hardly be extrapolated in other countries.

1. Insurance Business & Literature Review

Nowadays there are two kinds of insurance programs proposed to clients: `at fault' and `no fault'. The former indicates that an insurance company of a culpable of a crash is in charge for the reimbursement of the lesion. The latter denotes reimbursement of the lesion without split on the culprit and victim. In many countries, including Russia, the first type of program is mandatory for every driver, traded at constant price and controlled by authority, while the second type of program is permissive and proposed by agencies and traded at market price. In comparison to this, USA insurance market is organized in a completely different way: in some states operates only the first type program, while in the others only the second type. However, both of them furnish pliable variations for individual contracts and compose of various components: Auto Liability Coverage, Uninsured and Underinsured motorist coverage, Comprehensive Coverage, Collision Coverage, Medical Payments Coverage and Personal injury Protection. Furthermore, there are options to pay for the contract less sum of money, by offering discounts and deductibles [19].

The future of insurance business will thrive if companies will focus on internet of things. Venture investors started nourishing the industry by investing almost $4.3 billion in insurtechs between 2015 and 2016, which approximately 5 times more than in recent years. However, the authors claim, the arising profits between companies will not be divided equally, a few players will get most parts of the profit, while the others might face huge losses as happened to Germany, Spain and US [Figure 1]. Moreover, US insurance companies have lost nearly $4.2 billion in profit, and are expected to have 0.5-1% decrease, if there are no steps for implementation digital tools. The author suggests, that digitalization can even double existing profits in 5 years.

Figure 1. Source

Having read various articles in the field of traffic behavior, there weren't found any work approaching a problem concerning a detailed analysis of driving patterns of individuals using kinematic trackers on cars. However, the available literature covers adjacent topics for a better understanding of the whole problem and broadening the outlook.

Vehicle operating speed analysis is very important for transport policies, engineering regulations, safety instructions and road infrastructure (Bhowmik, Yasmin and Eluru, 2018, [1]). According to existing studies vehicle speed is one of the major factors influencing crash occurrences. The data was obtained for eight main arterials in state of Florida, including 268 segments for the year 2016. Moreover, there was additional information on such variables as characteristics of roadway, traffic, land use, environment and unobserved effects. This comprehensive approach enabled to get unusually high prediction score on speed. Also, the model included elasticity analysis to find factors, which mostly affect the speed profile: a percentage change in the main variable (vehicular speed) following the extension of an explanatory variable by 10%. It was discovered that length of the road mostly affects higher speed, while, on the other hand intersection density and size of industrial area cause diminution in probability of speed (between >20 and 25 mph). Moreover, it was examined that parameters responsible for speed distinguish for different roads. That's why authors decided to model proportions of vehicle speed categories by utilizing a panel mixed generalized ordered fractional split (PMGOPFS) model. The proposed model is evaluated exercising a quasi-loglikelihood based objective approach as maximum likelihood method is not appropriate for fractional results.

Furthermore, through a conducted comparison between traditional OPFS and GOPFS utilizing a loglikelihood ratio test it was proved that PMGOPFS model is prominent.

However, the research is based on the data provided by the Regional Integrated Transportation Information System, which had only aggregated recordings for every two hours. Concerning the micro-analysis of drivers behavior, this information is not very explanatory, as it requires more detailed information -for every 0-10 minutes .

Lately Klauer et al.[25] performed an assay of the influence of lack of attention on the probability of almost-crashes and crashes from 100 GPS-installed vehicles. Nevertheless, the authors analyzed such diversions as weariness, somnolence, eye glimpse and other motions (phone calling or eating) and did not mentioned driving quantitative parameters : mileage, average speed and accelerations.

Wahlberg (2004, [24] ) showed that there is a linkage between individual acceleration patterns of bus-drivers and crash involvement. Also, it was found that acceleration plays more important role when predicting crashes than speed. In some assays weather factor is mentioned as well, as it may be responsible for speed violations (Ahmed et al., 2012) [30]. It was discovered, that during dry season and low speeding the accidents are more probable, than usually.

Another work by Jun, J., Guensler, R., & Ogle, J. (2007, [2] ) examines whether there are significant differences between crash-involved and crash-not-involved drivers using in-vehicle monitoring technology. As a result, this research proves, that drivers who had accidents appear to drive at a higher speed than those, who were not involved in crashes, except freeway traveling during morning rush hours. Moreover, this study suggests that it is possible to identify risky drivers using data source similar to GPS-tracker. So, it seems that those drivers who had `accidental experience' tend to buy insurance and drive riskier than drivers sans insurance.

This fact was firstly disclosed in case of a medical insurance by Kennet J. Arrow (1963) [20] and in 1968 Mark V. Pauly gave a an additional comment on that work by depicting the effect of Moral-hazard problem, where he stresses the idea that people who have insurance are the ones who pretend to get excess medical care. In 1986 he enriched his concept by appending argument about ineffective system of tax deductibility, highlighting that clients who get large subsidies can actually afford buying insurance even without it [21], while the ones who get low subsidies are in need of extra encouragement. This effect can also be applied to auto-insurance industry (Vickrey, 1968) [26] and (Cummins, 1996) [27].

Recently, Edlin and Mandic (2006 , [31]) claimed that it would be better if insurance industry utilized pricing scheme as a tool to decrease the amount of accidents and maybe impose taxes on insurance premiums or quote them per mile vice per car and year [32].

A big contribution to this study was done by Lars Hultkrantz, Jan-Eric Nilsson and Sara Arvidsson in their paper (2012, [3]), where they investigated tools which encourage drivers to comprehend all consequences of their driving style (average, maximum speed and etc.) utilizing Pay-As-You-Speed (PAYS) insurance, as a special case of Pay-As-You-Drive (PAYD) insurance, which can deal with adverse selection problem and guarantee a perfect price discrimination, relying on real-time driving patterns (Panos Desyllas and Mari Saco, 2013) [23] . The main aim of this research was to use a Pigovian taxation scheme and PAYS to detect incentives which affect the way how people drive and reasons why they shift to another means of transport. It is important, that PAYS insurance has proved to be more efficient when targeting risk classes, comparing to PAYD and taxation, as it computes the insurance premium according to whether clients conform speeding rules or not. Moreover, the authors suggest implementing this tool as a key instrument in the government's traffic safety policy for several reasons. First of all, individual contracts mitigate the majority of arising problems in a fiscal setting, as they may offer a variety of programs and tariffs, allowing alternative decisions for those people, who do not want to be monitored. Secondly, insurance firms are able to charge differently risky drivers according to their previous `speed history'. Thirdly, Insurance companies are more mobile in dealing with good and bad behavior. The main disadvantage of the paper lies in its particular data which is given as zero-one binary variable, however, the main costs depend on the actual speed.

In Sweden Hultkrantz and Lindberg (2011, [28]) conducted an experiment using PAYS insurance. They noticed that participants decreased heavy speeding during the first month. However, during the second month only those drivers which received penalties altered their driving pattern. Practically the same experiment was held in Netherlands by. Bolderdijk et al. (2011, [29]) in cooperation with 5 Dutch insurance companies. According to the rules young drivers could pay less for their monthly insurance if not surpassed a speed limit. As a result speed violations dropped by 14%.

This study will cover the described disadvantages and make a few steps forward to solving the accidental problem.

Methodology

The main idea of this study is to combine various econometrics and machine learning technics to evaluate and interpret the results, which includes a clustering analysis, logistic regression, random forest, XGBoosting, Gradient Boosting and Cat Boosting.

A clustering analysis is represented by k-means method as it is the most simple and known algorithm for unsupervised data classification, when there are just input vectors without labelling. To find similar patterns the algorithm is looking for quantity (k) of centroids (clusters) in a dataset. Let X={}, i=1,…,n be the group of n-dimensional points that should be clustered into K number of clusters, C={, k=1,…,K}. K-means method identifies the combination which ensures the squared error between the empirical mean of a cluster and its points reaches the minimum. Let be the mean of cluster . Therefore, the squared error between and the points in cluster is defined as

,

The main idea behind this algorithms is to minimize the sum of squared error over all K clusters,

,

That is why a k-means algorithm can be greedy when converging to a local minimum, however when clusters are well separated it can converge to the global optimum. The visualization of clustering method is presented in the Figure 2 [4].

Figure 2.

Logistic Regression is usually implemented when the target is categorical (`Yes' or `No') or in our case - binary (1 or 0). It can be easily interpreted as the result from the initial hypothesis is the estimated probability. In this research the target variable is accident. Logistic Regression is universal, however it is practically impossible to get insights from sophisticated datasets with mutually dependent variables [13]. That's why it shows a poor performance comparing to advanced machine learning techniques, for example, Random Forest, Gradient Boosting Machines and its analogues, which will be introduced in this study as well. In general Logistic Regression can be described using the following formula:

,

To fit the model and get probability another equations should be solved:

,

The method has some significant advantages. First of all, the algorithm can be performed in a very simple way. Secondly, variance of predictions is low. Thirdly, Logistic Regression can be applied for extraction of the features. Moreover, the models are easily renewed using stochastic gradient descent.

In spite of such convincing arguments, the method is not ideal, as it cannot cope with huge quantities of categorical variables well and is not manageable enough to detect complex relationships [5].

Random Forest is a supervised machine learning algorithm, which builds a forest from Decision Trees. In many cases the `bagging' method is utilized to train the ensemble, which combines different learning models to advance the result (see Figure 3).

Figure 3. Source

The main advantage of the algorithm is the ability to add random effects to the model, while growing the trees. It optimizes the process of finding the most important variable by taking a random subset of variables instead of dividing the nod.

Furthermore, within Random Forest it is possible to obtain feature importance according to their influence on decreased impurity among the forest[6].

However, to make correct predictions the major hyperparameters should be tuned:

· N_estimators, which fixes the number of trees in the algorithm.

· Max_features, the maximum quantity of features to be divided a nod.

· Min_sample_leaf, the minimum quantity of leaves to split an internal nod.

Gradient Boosting machines (GBM) [7] is a machine learning tool for regression and classification issues, which creates a prediction model in the form of an ensemble of models with weak prediction score - often decision trees. The aim of this algorithm is to find and minimize a Loss function:

The first step of the method can be described using the following formula:

where L is the Loss function, - is a constant parameter. The second step requires computation of pseudo-residuals :

for i=1,…n and for m=1,…,M

Thirdly, the tree is fitted to pseudo-residuals. However, the renewed parameter is needed to complete the model. can computed using the following logic:

After that the model can be updated :

And finally we can get .

The method can be visualized like the sum of decision trees in the Figure 4 [11].

Figure 4. Source

However to prevent overfitting Gradient Boosting requires tuning of initial parameters to perform a better prediction model:

· Size of trees, which controls the maximum amount of iterations between variables.

· Learning rate, which shrinks the contribution of each tree

In order to deal with existing problems another algorithms were used in the research : XGBoost and LightGBM, whereas being the refreshed forms of Gradient Boosting, the methods proved to be quite successful in preventing overfitting and less time-consuming.

XGBoost is a very popular tool to win machine learning competitions. It has a lot in common with gradient boosting, however some new features enables it to win the race. First of all, it has advance system of tree penalization. Secondly, within this method the nodes of leaves can be shrunk proportionally according to their evidence. Moreover, it deals with extra randomization parameter, which can decrease correlation between the trees and thus lead to a better prediction model. Furthermore, XGBoost utilizes Newton-Ramphson method of function approximations, which allows to move directly to function's minimum [8].

Light GBM is a gradient boosting framework that utilizes algorithm based on trees[9]. The method develops decision trees horizontally, which means that Light GBM is leaf-wise. It finds leaves with maximum delta growth to grow, which enables to curtail more loss than level-wise algorithms. It is easily described in the Figures 5-6.

Figure 5-6. Source: Leaf-wise tree growth in Light GBM: Level-tree growth in XGBoost:

Moreover, It is known for being faster with far less memory consumption than analogues. It is noticeable that this method deals well with categorical data by taking names of the particular columns without using one-hot coding. The algorithm is trained to look for division value of the categorical data by implementing a new technology of Gradient-based One-side Sampling (GOSS). The technique is more rapid as it divides data into discrete bins to find the split value of histogram comparing to the process when all liable division points on the pre-sorted feature values are listed.

Catboost (Categorial Boosting) is an open-source library for gradient boosting on decision trees created by Yandex for wide purposes: ranking, forecasting and developing recommendations. It does not require pre-processing of categorial data into integers, and thus can guarantee higher accuracy scores. Furthermore, the method better defeats overfitting as it is the successor of the MatrixNet algorithm [10].

2. Data description

The idea of the research is to find significant behavioral characteristics of drivers, which will allow to enhance crash prediction models and then can be used to make real-time notifications to decrease accident rate.

That's why initial data needs a lot of preprocessing work. First of all, accurate data sources are necessary for identification of driving patterns of individuals. For this purpose the data from insurance company will be used. There are 2 million rows of train observations and several features: latitude, longitude, time and speed ( several levels including acceleration and turns). The variables are divided into two groups. The first is represented by raw datum, gathered by trackers installed in the cars. These features are shown in the Table 1:

Table 1. List of variables

Feature	Description
mileage	Total mileage
speed3_100	Frequency of speed violation on 40- 60 km/h
acc1_100	Frequency of I type accelerations
acc2_100	Frequency of II type accelerations
acc3_100	Frequency of III type accelerations
drg1_100	Frequency of I type braking
drg2_100	Frequency of II type braking
drg3_100	Frequency of III type braking
side1_100	Frequency of I type side accelerations
side2_100	Frequency of II type side accelerations
side3_100	Frequency of III type side accelerations
avg_daily_business_mileage	Mean mileage on business days
avg_daily_morning_jam_mileage	Mean mileage in morning rush hour
avg_daily_night_mileage	Mean mileage at night
avg_speed	Average speed
max_morning_jam_speed	Max speed in morning rush hour
max_evening_jam_speed	Max speed in evening rush hour
max_night_speed	Max speed in night time
max_speed	Max speed
crash	1 if there was an accident, 0 either

However, to fulfill the purpose of this study the data from Open Street Map was parsed including information about all traffic signals and their position (longitude, latitude). Moreover, to match this dataset with drivers a more detailed information was used (see Table 2).

Table 2. List of variables

Feature	Description
Longitude	1 Part of driver's coordinates
Latitude	2 Part of driver's coordinates
Date	Date of a ride
Time	Hour-Minute-Second of a ride
min_x	Minimum value of accelerations/braking per minute
max_x	Maximum value of accelerations/braking per minute
min_y	Minimum value of pressure while doing left/right turns per minute
max_y	Maximum value of pressure while doing left/right turns per minute
min_z	Minimum value of down pressure
max_z	Maximum value of down pressure

Also some modified variables were computed from the existing ones (see Table 3).

Table 3. List of variables

Feature	Description
min_x_more_mean	Mean value of smooth accelerations
max_x_more_mean	Mean value of extreme accelerations
min_y_more_mean	Mean value of minimum pressure right turns
max_y_more_mean	Mean value of maximum pressure right turns
min_z_more_mean	Mean value of minimum down pressure
max_z_more_mean	Mean value of maximum down pressure
min_x_less_mean	Mean value of extreme braking
max_x_less_mean	Mean value of smooth braking
min_y_less_mean	Mean value of minimum pressure left turns
max_y_less_mean	Mean value of maximum pressure left turns

These variables were created to expand the model with behavioral patterns, as they represent the way how a particular driver reacted on traffic signals ( his intensity when doing turns, accelerating and pulling up). Therefore these data will be used in forecasting models.

The variable “crashes” will be a target for the models, as it means whether an accident occurred or not.

K-means Clustering

Figure 7.

This method was applied to the initial dataset to divide drivers by their driving characteristics. This procedure will enable to enhance quality of prediction models during further steps. However, first of all we need to identify the number of cluster using algorithm which computes total within-cluster sum of squares for different numbers of clusters (tot.withinss). The more tot.withinss the better clusters are separated [12].

Despite the fact, that there is no ideal tool to choose the exact quantity, it is still useful to look where the slope roughly changes its angle. Within this approach was decided to take 5 clusters (see Figure 7).

After splitting our dataset into clusters it is good to see how accidents are distributed among them. Firstly, it will help in understanding the structure of all clusters and the level of `accidental environment' inside (see Table 4).

Table 4

№ of Cluster	Sum of accidents	% from all accidents	Sum of observations	% accidents from cluster observations
Cluster 1	272	8.52%	20413	1.33%
Cluster 2	47	1.47%	4242	1.11%
Cluster 3	835	26.14%	103072	0.81%
Cluster 4	908	28.43%	78342	1.16%
Cluster 5	1132	35.44%	137017	0.83%
All clusters	3194	100%	343086	0.93%

The table above proves that despite having different sum of observations, clusters are well defined, as there are no abrupt fluctuations between proportions of accidents in each group. It is important to examine these proportions, because they may lead to specific driving patterns, which cause accidents. So, each cluster has its own center and specific parameters, which are grouped around centers. In the Table 5 below there is information about each center of a cluster.

Table 5 Feature

Mileage	381.3	411.8	366.7	883.1	263.5
speed3_100	0.02	0.05	0.02	0.03	0.01
acc1_100	21.9	22.6	19.3	11.8	19.5
acc2_100	2.9	3.7	2.3	1.6	2.5
acc3_100	0.3	0.6	0.3	0.2	0.3
drg1_100	6.5	8.1	6.7	5.3	7.1
drg2_100	0.9	1.4	0.9	0.8	0.9
drg3_100	0.0	0.0	0.0	0.1	0.0
side1_100	6.7	7.3	6.5	5.1	6.8
side2_100	0.8	1.1	0.8	0.7	0.9
side3_100	0.2	0.3	0.2	0.3	0.2
avg_daily_business_mileage	268.5	290.1	258.6	634.5	182.1
avg_daily_morning_jam_mileage	33.5	33.1	32.3	75.1	23.8
avg_daily_night_mileage	63.7	70.1	61.2	159.8	41.3
avg_speed	24.9	26.5	25.4	36.2	23.7
max_evening_jam_speed	62.2	75.9	63.6	89.0	58.0
max_morning_jam_speed	73.4	79.7	69.1	99.8	63.2
max_night_speed	78.0	86.5	74.0	105.1	67.2
max_speed	117.0	129.3	114.1	139.6	110.0
insurance_sum (mln. Rub.)	1.7	3.9	0.3	0.0	0.0

From the Table 5 it is seen that clusters are well defined by their frequency of I type accelerations, mileage, average daily business mileage and max speed. Noteworthy, these parameters will be among the most important ones according to their influence on probability of crashes.

When dealing with clusters visualization can update our knowledge about the data. For instance, in the picture below there is information about clusters' distribution by two parameters: maximum speed in the morning rush hours and maximum speed in the evening rush hours. From the picture (see Figure 8) it is proved that there is a group of people who systematically drive very fast in the evening, which may lead to higher probability of getting into an accident.

Figure 8

So, clusters can actually improve the understanding of people behavior for understanding their rational and irrational decisions. Moreover, for the insurance companies cluster analysis can help as, it reduces dimension and thus increases speed and effectiveness of prediction algorithms.

After this iteration a new dummy variables included in the dataset :cluster_1, cluster_2, cluster_3, cluster_4, cluster_5. Furthermore, the dataset was run by Logit and p-value coefficient proved the significance of new cluster variables (see Appendix).

3. Behavioral Variables

The main purpose of this paper was to add behavioral characteristics to initial dataset and define how they influence prediction score.

The first step of this research is to make data preprocessing. From Open Street Map Moscow datum is parsed, so there is detailed information about every object in the area (latitude, longitude and description). According to our topic traffic data is required. Therefore, dummy variables are created to match initial observation with objects. If it equals '1', drivers rode by the object, which can appear to be a traffic light, traffic bump and etc. This will help us to analyze two major things. Firstly, to calculate an average speed of a particular driver near a specific traffic object. Secondly, to include a speed of a particular driver before/after a traffic object as a supplementary variable ( see Table 4).

The main hypothesis is that risky drivers very rarely slow down when seeing traffic lights: they stop almost in front of them and after a pause speed-up very fast.

The opposite situation will be observed on non-risk drivers: they might be more calm when approaching traffic lights or speed bumps. Moreover, schools and shopping centers should influence drivers' style. Therefore, the coordinates of massive objects will be added. The described characteristics can alter a scoring process of insurance companies.

As mentioned above, this research requires a lot of preprocessing steps, which have been done on a train-dataset (2 mln rows) and then will be extrapolated on the entire test data. This step is time-consuming, that is why a split on test-train was made.

Figure 9

First of all, timestamp data was replaced by corresponding dummy variables.

Secondly, the data from Open Street Map was converted into the appropriate format using parsing tools via Python.

The existing information was visualized in QGIS 3 for a better understanding of the data.

In the picture below (see Figure 9) there is `driving history' of one individual for three months.

In the test dataset there are fifteen unique drivers, and their accumulated driving history is presented in the Figure 10.

Figure 10

To compete the task the data about traffic objects was added. In the picture below (see Figure 11) it is represented by red stars.

Figure 11

In micro approximation it is important how drivers react on traffic lights, when passing by. And luckily, there is plenty of data to analyze ( see Figure 12).

Figure 12

As a result it was possible to calculate new variables for each driver, responsible for drivers' mean intensity when doing left and right turns, stopping and speeding up.

During the next step they were added to the initial dataset matching the driver they belong to.

4. Important Variables

When it comes to finding the most important variables, it is better to try various approaches as there is no definite solution. Besides, it is obvious, that decreased dimension will fasten all algorithms, as the most important variables will be enough to describe the variance the regressors. Principal Component Analysis (PCA) is a good evidence of this idea. The method comprises technic that transforms a large dataset into a smaller one without losing much information. However, if unessential features are dropped, accuracy of the prediction may decrease. That's why there is a tradeoff between speed and quality and it is better to check prediction score in the old and new dataset.

From the Figure 13 it is seen that 15 features can explain at least 90% of variance.

Figure 13

PCA analysis is very helpful, but it does not provide information which exactly variables should be eliminated.

In this paper 4 types of algorithms will be tested. The results will be analyzed to identify wasteful features.

The first is Boruta algorithm. It was utilized to derive essential features from initial 33. The algorithm defines them during 100 iterations through the dataset. The model resolves tradeoff between speed and accuracy, as it does not have any negative effects on prediction score and thus prevents over-fitting. As a result there are only main variables [14].

Boruta algorithm is a wrapper method installed in random forest classification algorithm. Random forest examines the importance of features through all trees by the following formula:

where - a class which was predicted in the beginning, - a class predicted after rearrangement.

After each iteration VI and parameter Z are defined for every feature using the rule below:

During the next step max Z parameter for shadow variables (MSZA) should be taken and compared with the initial Z (i-Z) parameters.

· If i-Z > MSZA, the feature is accepted, otherwise it will be rejected.

· If i-Z MSZA , a new iteration will be run till the moment when all variables are passed or rejected.

The method allows to get unbiased and stable selection of essential and unessential variables in the dataset as it uses cross-validation procedure. The algorithm iterates through a great number of random forest runs to get statistically significant split between significant and irrelevant variables. Noteworthy, Boruta adds some randomness and gathers results from the ensemble of randomized samples, thus provides a better understanding which features a really important.

According to Boruta there are only 9 important features through 100 of iterations (see Table 6).

Table 6

Importance	Feature	Rank
0	max_evening_jam_speed	1
1	max_morning_jam_speed	1
2	drg2_100	1
3	drg1_100	1
4	acc2_100	1
5	acc1_100	1
6	avg_daily_business_mileage	1
7	avg_daily_morning_jam_mileage	1
8	min_z_more_mean	1

The second is Recursive Feature Elimination (RFE) [15] which represents the idea to build a model (for instance, SVM or a regression model) and select the best or the worst variable ( in the majority of cases this iteration relies on coefficients) , put the variable away and reiterate the process with the rest of the variables. This action is ran till the moment when each variable in the dataset is fagged out. The next step is to rank variables based on time of their elimination. The algorithm is known for being greedy when choosing the most important features. A a result the algorithm gave us 15 essential variables. The output is sorted from the most important to the least (see Table 7).

Table 7

Rank	Feature
1	acc1_100
1	acc2_100
1	acc3_100
1	avg_daily_night_mileage
1	drg1_100
1	drg2_100
1	drg3_100
1	max_evening_jam_speed
1	max_morning_jam_speed
1	max_x_less_mean
1	min_x_less_mean
1	side1_100
1	side2_100
1	side3_100
1	speed3_100
2	min_y_more_mean
3	min_z_more_mean
4	max_speed
5	max_z_more_mean
6	avg_speed
7	max_y_less_mean
8	min_x_more_mean
9	max_x_more_mean
10	avg_daily_morning_jam_mileage
11	avg_daily_business_mileage
12	max_y_more_mean
13	min_y_less_mean
14	max_night_speed
15	mileage

The effectiveness of the method relies on the sort of model to be utilized for variable ranking during every iteration. That's why it was decided to use different models and compare them.

The third method comprised Logistic Regression as a base model and L1-regilarization for better accuracy. This method is obtained using the following formula for LASSO regression:

For some

Figure 14

Such sort of regularization can result in zero-coefficients, which means that some variables can be dropped for the calculation of output, thus providing us feature selection. However, it is needed to tune parameter alfa, that's why it was decided to test three variants : alfa={1, 0.05, 0.01}. The results are shown below (see Figure 14 and Table 8)

Table 8

Alfa	Training score	Testing score	№ of features
1	0	-1.39	0
0.05	0.025	0.005	21
0.01	0.032	0.003	28

According to these results was decided to choose alfa=0.05.

As a result LASSO provides list of the most important variables sorted by their rank (Table 9).

Table 9

Rank	Feature
1	speed3_100
2	side3_100
3	side2_100
4	side1_100
5	min_z_more_mean
6	min_y_more_mean
7	min_y_less_mean
8	min_x_more_mean
9	min_x_less_mean
10	mileage
11	max_z_more_mean
12	max_y_more_mean
13	max_y_less_mean
14	max_x_more_mean
15	max_x_less_mean
16	max_speed
17	max_night_speed
18	max_morning_jam_speed
19	max_evening_jam_speed
20	drg3_100
21	drg2_100

The last method is Extra-Tree Classifier (extremely randomized trees). With a help of random forests the algorithm abandons the obligation to utilize bootstrap versions of the learning sample. Moreover, instead of making attempts to define an excellent cut-point for every K simultaneously selected variables at each nod, the algorithm chooses a cut-point simultaneously. Noteworthy, when the algorithm does not use bootstrap it excels the model in terms of bias while randomly selected cut-points lead to perfect variance decrease result.

The output of the model is proportions of importance provided by each variable (Figure 15).

Figure 15

From the bar chart it can be concluded that only 14 features are crucial for the analysis.

Despite the fact that the models give different results in terms of quantity of variables, the behavioral ones still presented in each of them. It is an essential point of this research, as behavioral patterns of individuals have never been examined so closely. That's why further analysis of accident probabilities makes sense.

To proceed with our results it is needed to identify key variables. In this research Boruta and Extra-trees classification will be taken into more consideration as they represent more deep-dive learning. That's why a new set of variables will be formed according to top-ranking by these two algorithms. First of all, It will include every behavioral feature, as they contain exogenous information about the drivers' style. Secondly, they will contain acc1_100, side2_100, max_evening_jam_speed, max_morning_jam_speed, drg2_100, drg1_100, acc2_100, avg_daily_business_mileage, avg_daily_morning_jam_mileage and max_speed. As a result, 21 main variables out of 29 were chosen.

Moreover, it will be tested, whether modified variables affect prediction score ( standard deviation of variables and their mean value for 5 weeks). F

Models comparison

Firstly, a model with all variables will be described.

Logistic Regression

Accuracy = 0.98

ROC-AUC = 0.87

Classification report:

Table 10

	Precision	Recall	F1-score	Support
0	0.99	1.00	0.99	874
1	0.00	0.00	0.00	9
Micro avg	0.99	0.99	0.99	883
Macro avg	0.49	0.50	0.50	883
Weighted avg	0.98	0.99	0.98	883

Despite the fact, that the model has high Accuracy and ROC-AUC score, it is still fail to define crashes, as the recall characteristics for “1” is 0. It means, the model tries to predict mostly “0”, and thus contradicts with the idea of the research.

Figures 16-17

The results are also proved by Confusion matrix (see Figures 16-17). The algorithm did not predict any `1' correctly, while considering one `0' as `1'.

However, Logit can be used to get some essential insights about the features. The results of the model are presented below. For more information about coefficients, their standard deviation and P-values ( see Appendix).

Table 11-a

	Full model with clusters	Shortened model with clusters	Shortened model without clusters
Model:	Logit	Logit	Logit
Dependent Variable:	crash	crash	crash
Date:	2019-04-17 17:02	2019-05-08 17:02	2019-04-17 17:02
No. Observations:	343086	3531	3531
Df Model:	32	32	28
Df Residuals:	343053	3498	3501
Converged:	1.0000	1.0000	1.0000
No. Iterations:	10.0000	11.0000	11.0000
Pseudo R-squared:	-0.029	0.221	0.169
AIC:	37338.6222	405.0042	416.4057
BIC:	37693.22315	608.5923	601.4858
Log-Likelihood:	-18636	-169.50	-178.20
LL-Null:	-186116	-214.51	-214.51
Scale	1.0000	1.0000	1.0000

Table 11-b

Pseudo R-squared is equal to 0.169, however a model is known to be “good” with Pseudo R-squared in range 0.2-0.4. Nevertheless, Logit model can provide information about variables. A few of them have a relatively low P-value: min_x_less_mean, acc1_100, side2_100, max_morning_jam_speed, max_evening_jam_speed. Moreover, variables have different sighs of their coefficients. min_x_less_mean has negative sign, which means that the harder drivers stop, the less the probability of accident. The rest of variables have positive sign, which means that their inсrease can magnify the probability of accidents. It is quite obvious, as the more drivers speed up, the rougher they make turns and the more aggressively they ride during morning and evening rush hours the frequency of accidents enhances. tariff insurance price accident

To sum up, these features altogether mean, that even if a particular driver uses his/her car only for basic purposes (not too often), but acts carelessly when driving very fast, he is at risk.

XGBoost

Accuracy = 0.99

ROC-AUC = 0.90