Cheng Qian, Zhang Hongyi
(College of Chemistry and Environmental Science, Hebei University, Baoding 071002, China)
Abstract A 3-6-1 BPNN (back-propagation neural network) was used to link the relation between molecular structures and boiling points. The data sets were composed of 73 saturated aliphatic aldehydes and alkanones, with experimental boiling points ranging from 253.7 K to 631.2 K, and the number of carbon atoms ranging from 1 to 20. Each compound was characterized by three parameters obtained directly from its molecular structure information, namely functional group position index (P), carbon atom numbers (N) and methyl numbers (Nm), and the three parameters were together selected as inputs to the BPNN constructed in this paper. There was a good linear relationship between the predicted and experimental boiling points with regression coefficient of 0.9992, and the root mean square (RMS) error for the predicted boiling points was within 2.9 K. Through a comparison with multiple linear regression of experimental boiling points against functional group position index (P), carbon atom numbers (N) and methyl numbers (Nm), we found that the predicted boiling points by BPNN were superior than those predicted by multiple linear regression, which gave 11.6 K for the root mean square (RMS) error for the predicted boiling points. Compared with topological indices methods reported previously, the 3-6-1 BPNN proposed did not need additional knowledge or software package to calculate the complicated descriptors, and the number of selected parameters used for inputs was only 3 which was the smallest in all of the corresponding reports.
One of the most important purposes of applying mathematical, statistical and computer-based methods in chemistry is to gain the maximum information about the properties selected compounds by analyzing the chemical data. As a result, the interest in quantitative structure-property relationship (QSPR) studies has been increased substantially in these years.
Although topological index as an important QSPR tool has been successfully used for prediction of all kinds of physical chemistry properties of organic compounds, including boiling points, reports about the prediction of boiling points for saturated aliphatic aldehydes and alkanones are still few, due to the presence of oxygen in their molecules resulting in more complicated calculation of their topological indices than that of saturated aliphatic chain hydrocarbons. Balaban et al.  predicted 200 carbonyl compounds contained 127 mono-and dialdehydes and –ketones and 73 esters using five topological indices. Toropov and Toropova  obtained one-variable models of the normal boiling points of carbonyl compounds by the nearest neighboring code (NNC). Toropov and co-workers  have used simplified molecular input line entry system (SMILES) to model normal boiling points of acyclic carbonyl compounds. Several works about boiling point of saturated aliphatic aldehydes and alkanones prediction have also been done using molecule (or atom) topological index by Lin Zhihua and co-workers , Chen Yan , Zhang Xiuli and co-workers [6, 7], Feng Changjun, Yang Weihua , Wang Keqiang , and so on. However, the common shortcoming in these reports is that these models needs many topological indexes obtained through many difficult calculation steps. Therefore, it is necessary to find a new way for predicting the boiling points of saturated aliphatic aldehydes and alkanones.
As a powerful chemometric technique, artificial neural network (ANNs) has been used in the study of QSPR [10, 11]. Among the numerous network architectures, the popular type for QSPR studies is the multilayer feed-forward network with back-propagation algorithm, usually called back-propagation neural network (BPNN). As far as we know, no BPNN method for predicting boiling points of saturated aliphatic aldehydes and alkanones was reported. In this paper, a 3-6-1 BPNN was used to link the relation between molecular structures and boiling points. The data sets were composed of 73 saturated aliphatic aldehydes and alkanones, with experimental boiling points ranging from 253.7 K to 631.2 K, and the number of carbon atoms ranging from 1 to 20. Each compound was characterized by three parameters obtained directly from its molecular structure information, namely functional group position index (P), carbon atom numbers (N) and methyl numbers (Nm), and the three parameters were together selected as inputs to the BPNN constructed in this paper.
The results obtained by the 3-6-1 BPNN were validated, tested and compared with the results obtained either in previous reports or by multiple linear regressions.
- EXPERIMENTAL METHOD AND DATA
2.1 Theory of artificial neural network (ANN)
ANN models are formed by organizing a large number of simple processing elements (PE), also called neuron nodes into a sequence of layers and linking these layers with modifiable weighted interconnection . The schematic representation of neuron node structure is shown in Fig. 1.
Fig. 1 An artificial neuron model, wni is the weight associated with the connection from node n to node i, xn is the output of node i, n is the number of the output, yi is the output of the node, and qi is a bias term or threshold value of node i responsible for accommodating nonzero offsets in the data.
The network input for a node i is given by:
where j represents nodes in the previous layer, wji is the weight associated with the connection from node j to node i, xj is the output of node j, n is the number of the output, and qi is a bias term or threshold value of node i responsible for accommodating nonzero offsets in the data. The output of node i is determined by the transfer function and the net input of the node. It is given by:
where yi is the output of the node i, and f( ) is the transfer function which we can choose.
The BPNN was used in our work. The structure of this kind of neural network is shown in Fig. 2. The first layer is the input layer with one node for each variable or feature of the data. The last layer is the output layer consisting of one node for each variable to be investigated. In between there are a series of one or more hidden layer(s). A node in hidden layer(s) can receive data from any nodes of the anterior layer, process the data, and output a signal.
Fig. 2 A fully connected multilayer feedforward back-propagation network
In the BPNN, processing obeys back-propagation learning method. The BPNN output and its desired value are calculated in each iteration process. The changes in the value of weights can be obtained using the following equation:
where DWij is the change in the weight factor for each network node, di is the actual error of node i, and Oj is the output of node j. The coefficient h and a are the learning rate and the momentum factor, respectively .
The weight plays an important role in the propagation and back-propagation. Actually, a proper setting of these weight factors is essential. The process of adapting the weights to an optimum set of values is called training the neural network. Signals are propagated from the input layer through the hidden layer(s) to the output layer. If the difference between the desired solution and the one obtained does not attain minimum, signals will propagate backward to the input layer. In this course, the weights are adjusting ceaselessly. Through a number of iteration, the difference will reach to minimum. Then, training could be ended.
The back-propagation algorithm has a main problem that it often required a long training course. As a result, there are many modifications to this algorithm. Levenberg-Marquardt (L-M) algorithm is one of the modifications, and it is the learning rule used in our study.
All calculations were performed on a Founder Feiyue 6000A workstation with 256 MB DRAM memory. The operating system of the Founder PC is Windows XP home edition.
2.3 The data and neural network model
The data set of 73 saturated aliphatic aldehydes and alkanones were taken from the literatures [6,14] and the relationship of their boiling points against their molecular structure information was studied by means of back-propagation neural networks. For each compound, its name and corresponding experimental boiling point (Tb,exp) were given in Table 1 together with three parameters used as inputs in this study, namely the number of carbon atom (N), the carbonyl position index (P) and the number of methyl (Nm). In the three parameters N, P and Nm, the meaning for N and Nm is clear from their names. The carbonyl position index P is defined as the reciprocal of the sequence number for the carbonyl carbon in IUPAC system. Taking 2-methyl butyraldehyde as an example, the first step is to number each carbon in the main chain as IUPAC system. The second step is to find the sequence number for carbonyl carbon. For this example molecule, the sequence number for carbonyl carbon is 1, so its corresponding P is 1. Similarly, if the sequence number for carbonyl carbon was 2 or 3, the corresponding P would be 0.5 and 0.33, respectively. N, P, Nm were the input of the network, and Tb,exp was used as target of the network.
The data set of 73 saturated aliphatic aldehydes and alkanones were randomly divided into two sets: a training set (including 50 groups of data) and a testing set (including 23 groups of data). The compounds collected in the training set and testing set were shown in Table 2 as their serial number in Table 1.
Table1 Data of the 73 saturated aliphatic aldehydes and alkanones, the input and the target of the network.
|No.||Compounds||Carbon atom numbers||Carbonyl position index||Methyl numbers||Tb,exp/K|
|7||methyl ethyl ketone||4||0.5||2||352.8|
|12||methyl n-propyl ketone||5||0.5||2||375.5|
|14||methyl isopropyl ketone||5||0.5||3||367.4|
|26||methyl isobutyl ketone||6||0.5||3||389.6|
All data were standardized by using the following equation:
where xi is the original values for functional group position index (P), carbon atom numbers (N), methyl numbers (Nm) and boiling points, n is the number of the data, and ai was the result of the standardization.
Table 2 Compounds selected in training set and testing set
|Sets||Compound codes as listed in Table 1|
|Training set||56, 12, 35, 29, 44, 50, 33, 25, 64, 63, 20, 51, 66, 16, 46, 43, 13, 7, 59, 32, 45, 6, 71, 1, 60, 68, 2, 37, 17, 55, 19, 31, 48, 4, 52, 28, 54, 23, 38, 21, 18, 58, 73, 27, 47, 65, 41, 72, 57, 9.|
|Testing set||49, 10, 62, 22, 36, 34, 40, 26, 24, 42, 53, 67, 69, 15, 30, 61, 14, 8, 5, 70, 39, 3, 11|
To finish the task of predicting boiling points of saturated aliphatic aldehydes and alkanones, the following procedures were done. First, we should choose the appropriate hidden layer nodes. Second, we compared the effect of different learning algorithm using in the network. Third, the prediction was done. Finally, we estimated the predicted results, and compared the BPNN method with other methods such as MLR method and topological index method previously reported.
- RESULTS AND DISCUSSION
The neural network methodology has several empirically determined parameters. For example, the number of hidden layer nodes, the number of training epochs or the convergence criterion, the learning rate and momentum term, the initialization of the network, and so on. After confirming the input and target of the network, network optimization has to be done.
The training course will be ended when the mean square error (MSE) values for training and testing sets simultaneously reaches the minimum. In the whole training process, the MSE values for training and testing sets can be calculated, and through monitoring the tendency of MSE we can determine if stopping the training process. Generally, the MSE value for training set will constantly decrease with evolution of training epochs. Similarly, the MSE value for testing set also decreases with evolution of training epochs in the begining stage. But if continuing the training epochs, the MSE for testing set will on the contrary increase, resulting in inferior prediction results for testing set. To prevent the overtraining phenomenon, it is necessary to monitor simultaneously the MSE for both training and testing sets.
Early stopping was used in the optimizing training process. In ANN, early stopping [15, 16] is a pretty powerful and typical form using cross-validation which is the widely used method to avoid the overtraining (or over-fitting) phenomenon of neural networks. Early stopping means that the time stopping training proceeds is controlled by the minimum of errors of the validation (or testing) sets other than the minimum of errors of the training set. In general, the data set is divided into training, validation and test sets, while in the case of a small data set, the test set can substitute the validation set in cross-validation. So we divided the studied data of 73 groups into two sets, training set and testing set.
3.1 BPNN model confirmation
There were 3 nodes in the input layer and 1 node in the output layer. Our first aim was to determine the optimal number of hidden layer nodes. A series of neural networks with different numbers of hidden layer nodes were trained. The number of hidden layer nodes varied from 3 to 8. According to its generalization ability on the testing set, we calculated the mean square error (MSE) on different numbers of the hidden layer nodes. MSE is computed with the following equation:
where di is the desired output (the experimental boiling point) in testing set, oi is the actual output in testing set, and n is the number of compounds in the testing set. The lower the value of MSE, the better the network model. To see the transformation trend intuitively, a curve of MSE versus the number of hidden layer nodes was plotted (as shown in Fig. 3). Fig. 3 shows that the best number of hidden layer nodes is 6. So a 3-6-1 BPNN model was selected for further studies.
Fig. 3 Hidden node numbers vs mean square error (MSE) on testing set of the saturated aliphatic aldehydes and alkanones
The curves of the MSE for training and testing sets versus the learning epochs using the technique of the early stopping training are shown in Fig. 4. It shows that the MSE decreases swiftly in both training and testing sets when epoch is less than 6. The MSE changes are flat when epoch continues to increase. When epoch is 14, the MSE reaches the lowest value.
3.2 Learning algorithm comparison
Various learning rules derived from the first descent learning. Several different modifications of BP learning rule were selected in the training course. The learning epochs and the correlation coefficient of predicted boiling points and original experimental results are listed in Table 3. As shown in Table 3, L-M algorithm is the best learning rule for predicting boiling points of saturated aliphatic aldehydes and alkanones.
Fig. 4 The MSE of the training and testing set vs the learning epochs.
Table 3 Learning algorithm comparison
|Learning algorithm*||Training epochs||Min MSE/K2||Correlation Coefficient (r)|
*GDBP: gradient descent back-propagation; GDABP: gradient descent with adaptive learning rate (lr) back-propagation; GDMBP: gradient descent with momentum back-propagation; GDXBP: gradient descent with momentum & adaptive lr back-propagation; L-MBP: L-M back-propagation.
3.3 Predictions of boiling point using BPNN
As discussed above, we confirmed a 3-6-1 BPNN, and validated that the best learning algorithm was L-M back-propagation. Fig. 5 is the plot of experimental boiling points against predicted boiling points. Fig. 5 shows that almost each point falls on the straight line of y=x, indicating that the predicted results are close to experimental results. Linear regression showed that the predicted boiling points were in extremely good agreement with those of the experimental data. The linear regression equation is given as:
where Tb,pre refers to the predicted boiling point, and Tb,exp refers to the experimental boiling point. The correlation coefficient (r) was 0.9992, indicating that the prediction was especially successful. Twenty three boiling points in the testing set paired with the predicted results are given in Table 4. The relative predicting error for boiling points is 1.8 %. The RMS error was computed with the following equation:
where Tbi,pre refers to the predicted result, and Tbi,exp refers to the experimental boiling point, and n means the number of compounds in the testing set. The RMS error of this prediction is 2.85 K, which is significantly lower than the RMS error reported in the reference .
After predicting the testing set data, we simulated the training set data using the optimized BPNN. Fig. 5 also illustrated that the predicted results were perfect. Through linear regression, the linear regression equation was given as:
Its correlation coefficient is 0.9993, indicating that the predicted results were also extremely close to the experimental boiling points. The data are also listed in Table 4.
The residual errors between all the predicted and experimental boiling points are showed in Fig. 6. As shown in Fig. 6 residual errors for boiling points predicted by BPNN method were in the range from +9 K to -6 K. The boiling point of formaldehyde was excluded in the previous reports due to its smaller molecular weight, while in this report its absolute error is only 8.6 K and relative error is 3.2 %. Although the absolute error for predicting boiling point of formaldehyde is the maximum in our predicted results, this error falls in the acceptable range.
Fig.5 BPNN predicted vs experimental boiling points of all the data.
★ is predicted results of training set, △ is predicted results of testing set.
3.4 Method comparison
For comparison with BPNN, MLR analysis was carried out using the number of carbon atom (N), the carbonyl position index (P) and the number of methyl (Nm) as variables. The obtained MLR equation was:
y=298.4188+19.0872×N – 16.7502×P – 4.7512×Nm
R2 was 0.9774, and RMS error was 11.6 K. The residual errors given by MLR were plotted in Fig. 6 (b). As shown in Fig. 6, the residual error by MLR is in the range from +50 K to -20 K, while the residual error by BPNN for most data points is in the range from +6 K to –6 K, only two exceptional cases for formaldehyde and 4,4-dimethyl-2-pentanone. These results were significantly worse than the results achieved in BPNN way.
Fig. 6 Residual errors between all predicted and experimental boiling points. (a): BPNN method; (b): MLR method. u: upper limit of all data’s residual error, l: lower limit of all data’s residual error, z: line referred to residual error was zero.
Table 4 Experimental and predicted boiling points & errors of the methods BPNN and MLR
|Tb,pre/K||Error/K||Relative Error/%||Tb,pre/K||Error/K||Relative Error/%|
*belonging to testing set.
A comparison of our work with topological index methods previously reported was also made in terms of descriptor number, correlation coefficient (r) and the number of data set. The comparison results were given in Table 5. The BPNN method proposed in this work only needs three simple descriptors, which have exact chemical meaning, while topological methods [4-6, 8] previously reported need many descriptors obtained by complicated calculation procedures.
Table 5 Comparison of different boiling points predicting methods
|Method||Number and type of descriptor||Correlation coefficient||Reference|
|BPNN||3: carbon atom numbers; carbonyl position index; methyl numbers||r=0.9992||Our work|
|Topological index||12: extended molecular distance-edge (MDE, μ) vector, m1; m2; m3; m4; m5; m6; m7; m8; m9; m10; m12; m14||R=0.9989|||
|Topological index||3: topological index of atomic ordinal number mM, carbon atom numbers; 0M; 1M||R=0.9991|||
|Topological index||3: effective length of carbon chain; carbon atom numbers; inductive effect index difference between the corresponding branched and normal alkyl isomer containing the same carbon atom number||R=0.9987|||
|Topological index||3: connectivity index 1Q; converse index 1Q’; the largest point valence of carbon atom dmax||R=0.9990|||
3.5 Network structure validation
The 3-6-1 BPNN structure was built to predict the boiling points of 73 saturated aliphatic aldehydes and alkanones. The stability of the network structure was validated through several different random data grouping. Except the training set and testing set used above, the other 4 random divided situations were generated for predicting boiling points. Linear regression was done for the original boiling points and the predicted results obtained from these parallel experiments. The specific components in training and testing sets for 5 times of parallel experiments and their correlation coefficients for the original boiling points and the predicted results obtained from these parallel experiments were given in Table 6. The average of correlation coefficients for the original boiling points and the predicted results obtained from these parallel experiments was 0.9992, indicating that the 3-6-1 BPNN is stable and suitable for the prediction of boiling points of the studied 73 saturated aliphatic aldehydes and alkanones.
Table 6 Model validation
|Divided situation||Compound numbers||r|
|1||Shown in Table 2||0.9992|
|2||Training set: 26, 69, 40, 61, 59, 52, 18, 65, 63, 46, 11, 37, 60, 50, 30, 35, 44, 2, 1, 43, 72, 21, 55, 39, 20, 58, 56, 67, 12, 19, 17, 36, 8, 51, 16, 6, 13, 32, 64, 25, 54, 62, 15, 27, 70, 48, 31, 3, 34, 14||0.9996|
|Testing set: 41, 23, 66, 4, 9, 28, 68, 10, 49, 53, 22, 57, 73, 42, 7, 45, 33, 47, 71, 24, 38, 29, 5|
|3||Training set: 19, 51, 47, 45, 26, 72, 18, 12, 21, 73, 38, 69, 58, 33, 15, 65, 14, 24, 22, 49, 60, 31, 2, 39, 63, 48, 29, 32, 56, 66, 5, 42, 25, 57, 30, 10, 36, 4, 43, 1, 34, 28, 46, 7, 71, 37, 59, 23, 41, 68||0.9985|
|Testing set: 6, 3, 62, 17, 64, 35, 52, 40, 55, 44, 8, 70, 50, 16, 11, 20, 67, 9, 61, 53, 13, 27, 54|
|4||Training set: 38, 7, 45, 46, 36, 42, 35, 3, 49, 1, 54, 41, 64, 28, 58, 47, 29, 25, 14, 71, 69, 21, 68, 5, 37, 16, 57, 50, 32, 39, 19, 4, 12, 22, 56, 65, 34, 55, 62, 13, 63, 33, 48, 2, 30, 44, 43, 27, 31, 51||0.9991|
|Testing set: 15, 59, 40, 24, 53, 60, 66, 52, 18, 61, 20, 8, 11, 72, 26, 70, 6, 67, 73, 23, 17, 9, 10|
|5||Training set: 5, 73, 12, 23, 2, 72, 6, 36, 66, 31, 32, 8, 11, 20, 7, 69, 68, 10, 49, 34, 30, 53, 50, 48, 3, 56, 38, 25, 17, 29, 59, 14, 16, 42, 27, 64, 19, 51, 35, 58, 55, 41, 9, 61, 45, 47, 21, 24, 33, 60||0.9994|
|Testing set: 37, 57, 28, 52, 71, 13, 62, 4, 46, 44, 26, 22, 54, 18, 40, 39, 70, 65, 1, 43, 15, 63, 67|
The results obtained in this paper demonstrate that it is possible to generate robust networks capable of estimating the boiling points of saturated aliphatic aldehydes and alkanones using functional group position index (P), carbon atom numbers (N) and methyl numbers (Nm) as inputs. The advantage of this work performed here as compared with other methods is that no experimental parameters are required and the selected three parameters are easily obtained from the molecular structures for saturated aliphatic aldehydes and alkanones. The BPNN proposed in this work has been shown to provide more accurate prediction of boiling points than those through linear regression analysis approach.