Issam A.R. Moghrabi
ANALYSIS OF ALGORITHMS FOR CLUSTERING
RECORDS IN SPATIAL AND ASPATIAL DATABASES
Abstract
Response time of an information system can be improved by reducing the number of buckets accessed when retrieving spatial or aspatial records. One approach is to restructure the database in such a way that similar records are placed close together in the file space. This ensures a greater probability that records will be collocated within the same bucket. This paper is concerned with examining two algorithms proposed to solve the clustering problem and analyze their thus effected retrieval times using Random probability theory. The space density of a file is presented by measuring the 'closeness' or 'similarity' between adjacent records in the file space. Then, an attempt is made to estimate both the density and the number of buckets retrieved as a response to a random query to a random file using statistical inference. In addition, estimation methods are incorporated in order to compute the expected space density of a clustered file, organized for retrieval efficiency. One of the sequencing algorithms, based on using threshold values to increase average record similarity is used to obtain varying space densities. Results suggest that, given an acceptable confidence interval, the prediction of file properties, before and after sequencing, when the characteristic parameters of a file are known, is fairly accurate and the cost of the average query, in terms of the number of comparisons, can dramatically be reduced.
Key terms : Information Retrieval, Database systems, Clustering
1. Introduction
Information Retrieval (IR) is a discipline that involves the organization, structuring and searching for information in some database system. Information retrieval is concerned with the use of concepts and procedures for data storage organization and access. Depending on the nature of the stored information we may distinguish between bibliographic and the (formatted) record system. In each case it is common to represent both records and queries to the system as elements of a vector space [12,13].
In this paper we introduce a vector space model of a file for the formatted record case. The space density of such a file is presented by means of ensuring the closeness or "similarity" between adjacent records in the file space [14]. Consequently, an attempt is made to estimate both the density and the number of blocks retrieved as a response to a random query made on a random file using statistical inference. In addition, estimation methods are incorporated in order to compute the expected space density of a clustered file, organized for efficiency. Finally, a threshold algorithm is presented which is used to introduce a relation between space density retrieval performance and restructuring effort.
A multi-attribute file comprises a collection of objects or records. If each record in a set of S records is represented by a set of n attributes, then each record can be considered as ordered n-tuple of n values and can be referred to as a pattern. Record characteristics may be represented in the form of a vector of variables in a multidimensional vector space. If each dimension is associated with an index term then the n-dimensional pattern can be viewed as a vector describing the assignment of index term to a particular record. The file itself can be visualized as an [Sxn] pattern matrix. The rows of such a matrix define a vector or record and the columns define a feature or attribute. The projection of a pattern on a geometrical space result in a n-dimensional (or Euclidean) space [13]. Such an element:
![]()
is referred to as a point in geometry. The interpretation of file space properties may be aided by the representation of a record as a geometrical point and of a file as a subset of the Euclidean vector space. In the resulting vector space, the complete collection of all t vectors representing the available key terms forms the basis of the vector space and each vector
can be written as a linear combination of the vectors of the basis in a unique fashion.
For simplicity, the work presented here will assume a fixed number of keywords and will proceed for the case when the key terms are confined to particular fields within the record. The concepts presented in the subsequent sections easily apply to variable length records and with equal validity to spatial and aspatial records in Geographic Information Systems. There also exists a finite set of descriptors that key terms are chosen from. These are termed Domains. All records within a file are to be unique.
The hamming distance [11,12] explained in the next section gives an indication of how the proximity or disparity between records in a file can be measured. Only records that exactly match the user requirements constitute the answer set when querying a formatted record base. The subject of partial and probabilistic match retrieval has also been the focus of recent research [10,11]. A major concern in enhancing the retrieval time is the design of data structures and algorithms that operate on these structures to implement the operations required by database users. The most common data structures used are B
-trees [2], inverted files [1], binary coded schemes and grid files [5], R-trees [3] (for multi-dimensional spatial data) and hash techniques [4]. On the other hand, the cost of query processing can be reduced by directing attention to how records in the database can be clustered. The worry in this case is about how similar records are grouped together in adjacent buckets since they are likely to be retrieved together [6,7]. Normally, queries to a database file are inefficiently handled by either scanning in sequence all records in the database or by obtaining a set of addresses of relevant records from the directory of an inverted file system. The major aim of sequencing is to minimize the number of buckets retrieved so as retrieval time is optimized in some sense
The clustering problem has extensively been addressed in the past with numerous proposals for sequencing methodologies such as the ones derived in [12,14] for the partial match retrieval case. Other relevant algorithms are step-wise clustering algorithms [9], Single-pass single-link algorithms [11,12], Latent Semantic Indexing (LSI) [9], the automatic record pseudo classification and retrieval [14].
We will focus our discussion in this paper on a particular sequencing algorithm (Alg.I below) that is both practically and analytically studied. The algorithm is concerned with sequencing the records so as to minimize, in an optimal fashion, the sum of the hamming distances between the records in the file. This ensures that similar records are in close proximity to each other in the file space. However this file structuring algorithm requires comparisons of the order n2, as will be shown.
For the other sequencing algorithm (Alg.II), a modified version of Alg.I, a different insertion strategy is adopted so as to reduce the number of comparisons made for the sequence construction while maintaining the effectiveness of the algorithm. The enhanced version of algorithm I is based on the specification of a threshold value. Insertion of incoming records occurs immediately once the threshold condition is satisfied. Statistical methods are used to estimate the hamming distance of an unsequenced file and that of the same file after clustering using sequencing alg.II. The percentage reduction in buckets retrieved after sequencing is also estimated.
2. Hamming Distance
The files used in this work consist of records of fixed length. A key conversion system is assumed to map key terms into some integer representing the domain which is assumed to be the same for all attributes within the record to simplify the analysis to follow.
If our file is very large it would not fit within a single bucket of storage media but would occupy several buckets. When a record is to be read, the bucket in which this record is found is retrieved. If all records satisfying a given query are confined to 1 bucket then only a single bucket needs to be retrieved. On the other hand, if all the records to be retrieved, say n, are in different buckets then n buckets will be retrieved.
The hamming distance between 2 records is the number of key terms not in common between them [6,7,11]. Thus, the hamming distance between two patterns i and k may be defined as:
![]()
for
![]()
![]()
if ![]()
![]()
The total hamming distance of a file is the sum of all hamming distances between adjacent records. It is thus logical to have files with as small an average hamming distance or as dense as possible so that fewer buckets are retrieved when queries are made on the database. In order to achieve a small average hamming distance, rearrangement of the records in the database is required to place similar records as adjacent as possible. This rearrangement is usually termed as sequencing or clustering. Average Hamming distance is thus a measure of the space density of a file. A small Hamming Distance implies a 'concentrated' space density.
3. Sequencing Algorithm 1 (Johnson's Algorithm [7,8])
This algorithm clusters an unsequenced file to optimal form. The sequencing process involves retrieving the records in the order they were originally received and finding for them new positions among the so far sequenced records. The position that is sought for each incoming record is one leading to the minimum increase in hamming distance thereby the resultant file would have the minimal possible hamming distance. All possible positions at which an incoming record can be placed are tested. For each position, the increase in hamming distance is computed and track is kept of the position giving rise to the smallest increase in hamming distance where the incoming record is ultimately placed. The sequencing stops when the sequenced list of records consists of all the file records.
In principle, the algorithm's only input is what is known as a Proximity matrix, denoted by M = d(i,j). An entry d(i,j) in the matrix denotes the hamming distance between clusters i and j. At start, each record is treated as one cluster. On the first pass, the nxn proximity matrix is scanned for the most similar pair of clusters; i.e.,
d(r,s) = min {d(i,j)}, ![]()
When r and s have been determined, the two clusters are merged into a single one. The rows and columns corresponding to clusters r and s in the proximity matrix M are made inactive and a new row and column corresponding to the newly formed cluster are inserted. The new entry is defined as
d(k,(r,s)) = min {d(k,r),d(k,s)}.
The second pass works now on an (n-1)
(n-1) matrix and proceeds as in the first pass. The process is repeated until all the records (original clusters) are in one cluster.
The first record retrieved from the unsequenced list of records has only 1 possible insertion position. The 2nd has 2 since it can be placed either in front or at the back of the 1st record and both positions give rise to the same increase in hamming distance. It is therefore placed at either position. The 3rd record will have 3 possible positions. If there are n records in the file the nth record will have n possible positions to try and hence n comparisons to make. Therefore the total number of comparisons made in sequencing a file of size n is (1 + 2 + 3 + ... + n) = (n × (n +1))/ 2 = (n2/2) + (n/2) which is O(n2/2).
Seeking the optimal arrangement of records in the file buckets makes the algorithm expensive to implement. Attempting to reduce the number of comparisons made in the clustering process results in the algorithm described in the next section.
4. Sequencing Algorithm II
The way this algorithm operates is similar to that of sequencing algorithm 1 with the difference of it being based on the specification of a threshold value [8]. When the latter is satisfied, no further comparisons are made and an insertion of the record in the bucket that satisfied the threshold condition occurs, thus reducing the number of comparisons.
Given a file of size N and a dissimilarity index definition d, a complete weighted graph denoted by (N,d) may be constructed. Each of the N nodes in the graph represents a record in the file space of size N. The weight d(x,y) on each one of the N(N-1)/2 edges of the graph represents the dissimilarity between the 2 records x and y connected by the edge (x,y). The algorithm's goal is to construct a short spanning path through the nodes thereby reducing to near optimal the overall hamming distance.
Thus the operational steps of the algorithm can be specified as follows:
A subpath of the graph (N,d) is an ordered subset of the nodes in N. (P,m) denotes a path including a path P and an extra node m which is not in P. The path (P,m) is formed as follows:
Case 1: If P passes through more than one node then
(a) find an edge (x,y) between nodes of P such that the value
[d(x,m) + d(m,y)] - d(x,y)
h (some user-defined threshold value)
(or is minimal if no such edge exists)
(b) delete edge (x,y) and add edges (x,m) and (m,y) to form path (P,m)
Case 2: If P passes through exactly one node n then make path (P,m) as the edge
(n,m) or (m,n).
An incoming record starts at the front of the already sequenced list of records and tries one position after the other until a position is found where the increase in hamming distance is less than or equal to the specified threshold value. If no such position is found, the insertion would take place at the position yielding the smallest increase in hamming distance (even though it is greater than the threshold value specified) which had been kept track of during the search for the first position.
The increase in hamming distance ranges between 0 and r inclusive, r being the number of key terms per record. It is 0 when the incoming record is placed adjacent to a record that it is identical to. The increment is r when the record is placed between 2 records it has not a single common key term with and the 2 records also do not have any common key term.
The higher the threshold value, the fewer positions the incoming record needs to try before it finds itself a satisfactory position. In this case, the threshold condition is lenient. A threshold value of r results in a total number of comparisons that is equal to the number of records in the file minus 2 since the first incoming record to be sequenced is the third one in the unsequenced list and each incoming record makes exactly 1 comparison. With such a threshold value no sequencing is done and the resultant file turns out to be a copy of the original file.
A low threshold value implies a higher number of positions tried and therefore a greater number of comparisons. For example, with a threshold value of 0, and if the records in the file are unique, no position tested will be satisfactory. Incoming records will all end up being placed at the position that gives rise to the smallest increase in hamming distance. Hence sequencing algorithm 2 with a threshold level of 0 degenerates into sequencing algorithm I. In general, the number of comparisons made in sequencing algorithm 2 is always less than or equal to that of sequencing algorithm 1. It is equal in the case of a threshold value of 0.
A wise choice of the threshold value ensures that sequencing algorithm 2 could be far more efficient than sequencing algorithm 1, i.e. producing an almost fully sequenced file with far less comparisons made.
5. Statistical Analysis of Algorithm II
5.1 Estimating Expected Hamming Distance for a Non-clustered File
For the following statistical evaluation to be valid, the assumption that records are initially randomly distributed within the file is made [15]. This is to say that there is no preconceived arrangement of the records.
Let the range of possible values of the key terms in a record be 0 to n. If there are r key terms in a record, then the total number of possible records with different combination of key terms that can be obtained is
(r
n). In selecting the r key terms for a record all the n possible choices have an equal chance of being selected. The probability of any of these record combinations being part of a file is
. That is all the record combinations have equal chance of being selected to be in the file at every selection, assuming they are being selected with replacement.
Let the first record selected be A, with r key terms, and the second record be B also having r key terms. If x is the number of similar key terms between A and B then the hamming distance between A and B is (r - x). Therefore the probability of obtaining a hamming distance of (r-x) between A and B is given by:
![]()
(1)
In order to obtain the expected hamming distance we sum over all possible values of x, the product of hamming distance (r - x) and its probability of occurring P(x). The possible values of x are between 0 and r inclusive. The two extremes being x = 0, when A and B are completely different and do not have a single key terms in common, and x = r when A and B are identical. Therefore the expected hamming distance can be obtained as
, (2)
with P(x) as in (1).
If there are N records in the file then each incoming record will have the same expected hamming distance between itself and the records it is placed next to. By summing the (N-1) expected hamming distances, we obtain the total hamming distance. The expected total hamming distance can be estimated via
(3)
The average hamming distance is computed using
(4)
5.2 Estimation of Hamming Distance after Sequencing
This section deals with the estimate of hamming distance after the sequencing of a random file using sequencing algorithm 2. After clustering the possible hamming distance between adjacent records is assumed, for the sake of the analysis to follow to be less than or equal to k. Therefore the hamming distance could be 0 or 1 or 2 or ... k.
Let p be the probability of the jth attribute being a match between two adjacent records. If clustering algorithm II finds an insertion position for the record that yields an increase in hamming distance that is less than or equal to the prescribed threshold value (k), then the probability that the jth attribute of the record under consideration matches the corresponding attribute of an adjacent record is
, instead of 1/n (n is attribute value range) before clustering. This is attributable to the fact that the maximum number of terms not in common between any two consecutive records after sequencing is k, making the minimum number of terms in common r - k. Therefore, the expected hamming distance in a dense file is given as:
, where
. (5)
If there are N records in the file, the expected total hamming distance is obtained as
. (6)
The expected average hamming distance after sequencing using threshold value k with sequencing algorithm 2 is computed using (4), with E(hd) as in (5).
When the threshold value is 0, the minimum number of similar key terms between adjacent records (r - k) = r , the number of key terms per record. This implies that all records are identical and the file is made up of identical records. Substituting k = 0 in (6) the expected total hamming distance is 0, which is what is expected if all records in a file are identical.
P(x) is the probability that there are x common terms. This is equal to the probability that there are (r - x) terms not in common. These (r - x) terms not in common, will be chosen from (n - r) key terms (the key terms that did not appear in the original record which has r key terms). Let us examine how P(x) can be calculated.
Let the number of possible choices = n (random range), the number of satisfactory choices = n - r (key terms not in original record), the number of satisfactory choices chosen = r - x (key terms not in common) and x = key terms in common. Thus
![]()
If k = 2 the minimum value for the x equal r-2. The possible hamming distances are 0,1, and 2.
5.3 Estimation of Percentage Reduction in Buckets Retrieved after Sequencing
Let p be the probability that a record satisfies a query. If there are N records in a file, the probability that Y of these records satisfy the query follows a binomial distribution. Such probability is given as
(7)
If Y is the number of records retrieved, for a completely random file, these records will be randomly distributed in the file. We would therefore expect the records to be in 1 bucket or 2 buckets or 3 buckets or ........ or Y buckets. If we assume each of these possibilities has an equal probability of occurring, then each has a probability of
of occurring. Then, the estimate of the buckets retrieved is the sum of the product of the possible number of buckets retrieved and their probabilities. This gives the estimate as
![]()
.
Now if the file is sequenced, it is expected that all the records to be retrieved will be placed consecutively. The number of buckets retrieved will therefore depend on the value of Y and the bucket size. If Y is less than the bucket size than it will be expected that only one bucket will be retrieved. In general the number of buckets retrieved with bucket size B is at least
.
Therefore the percentage reduction in the buckets retrieved after sequencing, if the number of records retrieved is Y, is obtained as
(8)
In general and summing over all possible value of Y, the expected reduction in buckets retrieved can be obtained as
. (9)
5.4 Estimation of Number of Records Satisfying a Query
Let
®
q be the query size®
x be the number of data attributes that match the query record at hand®
r be record size (r „ q).Also let p be the probability of a single attribute matching a corresponding query attribute. If the attribute range is n, then p' = 1/n.
If there are q attributes in the query, the probability that x data attributes match these q attributes follows a binomial distribution. The probability that x attributes out of q match is given as
, where p' = 1/n. (10)
For a data record to satisfy a query all attributes in the query record should have matching entries in the data record. That is to say x is the same as q. Thus, (10) reduces to
, which is the probability of a record satisfying a given query. This last quantity can thus be used in computing (7) and (9).
Estimate of the expected number of records to be retrieved is therefore
![]()
where p is the same as in (7).
If r = n i.e. all the files have the size of random range, then the probability that a key term will be in a file is
. That is, there is 100% certainty that a key term in the range of possible key term will be found in a record. For a record to satisfy a query all the key terms in the query must be found in the record being queried. Therefore for the x key terms in the query, starting from the first to the xth key term, each of these key terms must be found in the record for the record to be read. If any of these key terms is not present the record of concern is not.
Considering the first key term in a query, the probability that a key term will be found in a record being queried is
. Since query key terms are not repeated within a query, the 2nd key term within the query will be different from the 1st. We therefore do not consider the 1st key term any more but consider only the remaining (n - 1) key terms in the random range and (r - 1) in the record being queried. Therefore the probability that the 2nd key term in the query is found within the record being queried, after the 1st key term has been found, is
. Similarly, probability that the 3rd key term is found after the 1st and the 2nd have been found is
. The xth key term will have a probability of
of being found.
The probability that a record of r key term will satisfy a query is the probability that the 1st key term of the query satisfied and the 2nd key term of the query is satisfied and the third key term is satisfied ... and the xth key term is satisfied. Therefore this probability is the product of the individual probability of key terms being satisfied.
Probability of a record of size r being retrieved by a query of size x
![]()
where n is the number of key terms that both the record and query are selected from.
Let p be the probability that a record satisfies a query. This implies (1-p) is the probability that a record does not satisfy a query. If there are N records in the file , the probability that Y records of these records satisfy the query follow a binomial distribution. The probability that Y records out of the N records in the files satisfies the query is given as
, (11)
Y can take values from 0 to N, i.e. from no record to all the records being satisfied. Therefore, by summing from 0 to N the product of all values of Y and their probability of occurring we obtain an estimate of the number of record to be retrieved.
Estimate of expected number of records to be retrieved is obtained
(12)
If Y is the number of records retrieved, for a completely random file, these records will be randomly distributed in the file. We would therefore expect the records to be in 1 bucket or 2 buckets or 3 buckets or ...or Y buckets. If we assume each of these possibilities has an equal probability of occurring, then each has a probability of
of occurring.
The estimate of the buckets retrieved is the sum of the product of the possible number of buckets retrieved and their probabilities. This gives the estimate as
![]()
![]()
Now if the file is sequenced, it is expected that all the records to be retrieved will be placed consecutively. The number of buckets retrieved will therefore depend on the value of Y and the bucket size. If Y is less than the bucket size than it will be expected that only one bucket will be retrieved. In general the number of buckets retrieved for with bucket size B is
Therefore the percentage reduction in the buckets retrieved after sequencing, if the number of records retrieved is Y, is obtained as
(12)
In general and summing over all possible value of Y, the expected reduction in buckets retrieved can be obtained as
.
6. Simulation Experiments and Conclusions
Our experimental results can be summarized in the following table for the parameters: N = 1000 (file size), n = 4 (attribute range), r = 10 (record size), q = 3 (query size ) and a file of 100 query records:
|
Threshold |
Hamming distance (actual) |
Hamming distance (estimated) |
Buckets retrieved (actual) |
Buckets retrieved (estimated) |
Comparisons |
Bucket Reduction |
|
0 |
3275 |
3253 |
3292 |
3277 |
491904 |
0.3554 |
|
1 |
3278 |
3255 |
3291 |
3284 |
476627 |
0.3556 |
|
2 |
3308 |
3299 |
3293 |
3289 |
422488 |
0.3552 |
|
3 |
3421 |
3412 |
3334 |
3325 |
314650 |
0.3472 |
|
4 |
3803 |
3797 |
3575 |
3568 |
160245 |
0.3 |
|
5 |
4511 |
4503 |
3984 |
3976 |
45321 |
0.2199 |
|
6 |
5295 |
5289 |
4383 |
4372 |
10478 |
0.1418 |
|
7 |
5927 |
5921 |
4696 |
4691 |
4320 |
0.0805 |
|
8 |
6374 |
6366 |
4861 |
4855 |
2838 |
0.0482 |
|
9 |
6875 |
6871 |
4928 |
4919 |
2172 |
0.035 |
|
10 |
9922 |
9922 |
5107 |
5107 |
998 |
0 |
|
Unseq. |
9925 |
9925 |
5107 |
5107 |
- |
0 |
|
Totals |
65915 |
65813 |
49851 |
49770 |
1932041 |
- |
Table 2 Comparing simulation results with statistical formulae figures
The statistical estimations obtained from the formulae derived in section 5 are comparable to the simulation results, as the figures in table 2 suggest and the discrepancies are due to the practical implementation of the algorithm. We would like to stress on the characteristics of the information retrieval system we have been simulating and also the various parameters (such as bucket size and threshold value) that affect the practical performance of the algorithm. Algorithm I is a special case of algorithm II, obtained by setting the threshold value to zero. Also choosing a threshold value that is the same as the number of attributes in a record is equivalent to applying no sequencing to the file and leaving it random. These properties are verified by the results presented in table 2.
Various threshold values give rise to varying average hamming distances. The smaller the threshold value, the smaller the average hamming distance. This implies that the smaller the threshold value, the more dense the sequenced file is. Using large threshold values (greater than the average hamming distance of the random file) does not produce any significant reduction in the hamming distance. Using varying threshold values result in varying number of comparisons. As the threshold value reduces, the number of comparisons is consequently reduced.
As a further means of assessing performance, the following figure compares the Hamming distance (HD), number of buckets retrieved (Buckets) and the number of comparisons (Comp), as a function of the threshold value, using the simulation in table 2:

Further research may concentrate on finding a model that predicts the 'best' combination of the parameters that govern the practical response time. To this end, linear regression techniques are under investigation.
Acknowledgments
The author completed the above work while on a Fulbright scholarship at the University of Wyoming, U.S.A. and would thus gratefully acknowledge the support received by the Council for International Exchange of Scholars, Washington D.C.
References
1. Cardenas, A.F., Analysis and performance of inverted database structure, Communications of the ACM, 18, 5, 253-263, 1975.
2. Comer, D., The ubiquitous B-tree, ACM computing surveys, 11, 2, 121-138, 1979.
3. Guttman, A., R-trees: a dynamic index structure for spatial searching, Proceedings of the 1984 ACM SIGMOD International conference on the mgt. of data, 47-57, 1984.
4. Larson, P.A., Linear Hashing with partial expansions, Proceedings of the 6th international conference on VLDB, 224-232, 1980.
5. Nievergelt, J. and Hinterberger, H., The grid file: An adaptable symmetric multikey file structure, ACM Transactions on database systems, 9, 1, 38-71, 1984.
6. Jain, A.K. and Dubes, R.C., Algorithms for Clustering Data, Prentice-Hall, Englewood Cliffs, New Jersey, 1988.
7. King, B., Step-wise clustering procedures, Journal of the American Statistical Association, 69, 1967.
8. Lowden, B. G. T., An Approach To Multikey Sequencing In An Equiprobable Key term Retrieval Simulation, Proceedings of 8th ACM SIGIR Conf. on Research and Development in Information Retrieval, 1985.
9. Mauldin, M.L., Conceptual Information Retrieval: A case study in adaptive partial parsing, Kluwer Academic publishers, 1991.
10. Moghrabi, I.A.R., Expert Systems and Their Use in Libraries, Sci-Quest, 5, Issue 2,1995.
11. Salton, G., Dynamic Information and library processing, Prentice-Hall, Englewood Cliffs, new Jersey, 1975.
12. Salton, G. and McGill, M.J., Introduction to Modern Information Retrieval, 1983, McGraw Hill.
13. Wong, S.K.M., Zierko, W. and Ranghvan, V.V., On Modeling of information retrieval concepts in vector spaces, ACM Transactions on Database Systems, 12, No.2, 1987.
14. Yu, C.T. Lam, K. Suen, and siu, M.K., Adaptive Record clustering, ACM Transactions on database systems, 10, 2, 180-204, 1985.
15. Yu, C.T. Luk, W.S. and Siu, M.K., On the estimation of the number of desired records with respect to a given query, ACM Transactions on database systems, 3, 1, 41-56, 1978.
16. Ymmanuel, M.B. -Clustering Techniques for Record Databases, M.Sc Dissertation ( University of Essex 1993)
Author Information
Issam A.R. Moghrabi, Natural Science Division, Lebanese American University, P.O. Box 13-5053, Beirut, Lebanon
email: imoghrbi@lau.edu.lb