7 messages in com.mysql.lists.mysqlRe: Optimising for many rows and retu...| From | Sent On | Attachments |
|---|---|---|
| Nick Hill | 23 Apr 2006 08:48 | |
| Adam Wolff | 23 Apr 2006 12:14 | |
| Adam Wolff | 23 Apr 2006 12:15 | |
| Alexey Polyakov | 23 Apr 2006 14:12 | |
| Nick Hill | 23 Apr 2006 14:40 | |
| Nick Hill | 24 Apr 2006 07:57 | |
| Adam Wolff | 24 Apr 2006 08:22 |
| Subject: | Re: Optimising for many rows and returned records (de-coupling query time to record set size for range queries)![]() |
|---|---|
| From: | Adam Wolff (awo...@gmail.com) |
| Date: | 04/23/2006 12:14:27 PM |
| List: | com.mysql.lists.mysql |
I didn't look through your code very carefully. Have you confirmed (using EXPLAIN) that your select query is using the index?
I don't much about the mysql optimizer, but it's possible that this query:
$query1="SELECT lat,lon from integer_test WHERE lat>$lat1 and lat<$lat2 and lon>$lon1 and lon<$lon2";
Actually runs through the table four times instead of twice, and maybe can't even use the index for the whole query.
Is the performance different if you use a subqery? Some thing like: SELECT id, lat, lon FROM integer_test WHERE lat>$lat1 and lat<$lat2 LEFT JOIN (SELECT id, lat, lon FROM integer_test AS i2 WHERE lon>$lon and lon<$lon ) ON integer_test.id = i2.id
assuming that you have independent indicies on lat and lon. I didn't try this so the syntax may be wrong. You could also dispense with the id idea, but if so you should probably declare (lat,lon) as your primary key.
A
and lon>$lon1 and lon<$lon2";
On Apr 23, Nick Hill wrote:
Hello
I have been looking at planning the database strategy for openstreetmap (http://www.openstreetmap.org).
There are several data types stored in tables with longitude and latitude columns. Select statements work by selecting
where lat>$lat1 and lat<$lat2 and lon>$lon1 and lon<$lon2
I have made many empirical tests and have concluded:
1) I can improve performance by a factor of 2-2.5 by changing the double lat/lon to an integer then selecting on an integer.
2) I have concluded that for each 10 fold increase in the number of records, select queries take twice as long. For each doubling of the number of returned records, there is a sqrt(2) increase in select query time.
All this is assuming all relevant database information is in memory.
As the database grows, it would likely improve database performance by splitting an individual table into several thousand tables using the file system directory btree algorithm to effectively pre-select the data before the query is handled to the MySQL engine. This is not a neat solution. A much better way would be to improve the mysql index performance on very large numbers of records.
Given that there is such a strong relationship between the number of records returned, and query time, I conclude that the whole index tree is matched for every given number of root x records returned. If all records we are matching are under a single node or under a small number of nodes in the index tree, perhaps there is some way of telling the database engine to ignore the rest of the index tree.
Could this work, or am I misunderstanding how the index tree works? Are there existing optimisations which can de-couple the relationship between number of records and query time where the records I am selecting are within a small range?
Background information:
We can boil all this down to a mathematical relationship where query1 selects s number of records from r records dataset and query2 selects b number of records from c records dataset
Tquery1 is time to execue query 1 and Tquery2 is time to execute query2.
Tquery2=Tquery1 * sqrt(b/s) * (2^log(r/c)) + (b-s*CONST/15000)+CONST Where for my processor, CONST is 0.03
This can be simplified (loosing some accuracy) to:
Tquery2=Tquery1 * sqrt(b/s) * (2^log(r/c)
Raw data for selects: Creating a plan with 100000 points and averaging over 25 queries Points_per_tile Query_Time 25600 0.118 25600 0.119 25600 0.119 25600 0.119 12800 0.069 6400 0.042 3200 0.026 1600 0.017 800 0.011 400 0.008 200 0.005 100 0.004 50 0.003 Creating a plan with 1000000 points and averaging over 25 queries Points_per_tile Query_Time 25600 0.224 25600 0.223 25600 0.222 25600 0.223 12800 0.145 6400 0.093 3200 0.062 1600 0.043 800 0.029 400 0.020 200 0.015 100 0.011 50 0.008 Creating a plan with 10000000 points and averaging over 25 queries Points_per_tile Query_Time 25600 0.558 25600 0.548 25600 0.551 25600 0.551 12800 0.376 6400 0.257 3200 0.181 1600 0.125 800 0.087 400 0.062 200 0.044 100 0.031 Creating a plan with 100000000 points and averaging over 25 queries Points_per_tile Query_Time 25600 2.422 25600 2.332 25600 2.493 25600 2.446 12800 1.769 6400 1.295 3200 0.866 1600 0.657 800 0.456 400 0.328 200 0.233 100 0.159 50 0.118
Source code for the above test: #!/usr/bin/perl -w
#Program creates random point fields eqyuivalent to bitfieldtest.pl except the data is stored #as regular signed integers. To represent the globe as closely as possible, extents between #-180 and +179.999999 will be used. Therefore, adding 180 normalises for international date line 0. #Prime Meridian 180. 111111**0.000001
use DBI; use Time::HiRes qw( usleep ualarm gettimeofday tv_interval );
$DBHOST = "localhost"; $DBNAME = "nickh"; $DBUSER = "nickh"; $DBPASS = "xxxxxx";
#initialise database $driver = "mysql"; $dsn = "DBI:$driver:database=$DBNAME;host=$DBHOST"; $dbh = DBI->connect($dsn, $DBUSER, $DBPASS);
#@plane_densities=(100000000); @plane_densities=(100000,1000000,10000000,100000000); @tile_points=(25600,25600,25600,25600,12800,6400,3200,1600,800,400,200,100,50); $query_iterations=25; $debug=0;
sub create_bitfield; sub run_tests;
foreach $density(@plane_densities){ print "Creating a plan with $density points and averaging over $query_iterations queries\nPoints_per_tile Query_Time\n"; create_bitfield($density); foreach $tilepoints(@tile_points){ my $testtime=run_tests($density,$tilepoints); printf '%-14d %.3f',$tilepoints,$testtime; # prints "<1.0>"print "$density $testtime s\n"; print "\n"; } }
$dbh->disconnect; exit 0;
sub create_bitfield{ #takes number of points on the point field as argument. #We first create table without an index, as the indes slow populating table my $prep=q~ DROP TABLE IF EXISTS integer_test;
CREATE TABLE `integer_test` ( `lat` integer NOT NULL default '0', `lon` integer NOT NULL default '0' ) TYPE=MyISAM ~;
#drop/create tables without index foreach $aprepare(split(/;/,$prep)) { #print "preparing $preparation"; $dbh->do($aprepare); }
#populate table for ($i=0;$i<$_[0];$i+=100){ #create 100 element batched inserts (value,value),(value,value) etc my $insert=''; for ($j=0;$j<100;$j++){ if ($j>0) {$insert .= ','; } $lat=int(rand 3599999999)-1800000000; $lon=int(rand 3599999999)-1800000000; $insert .= "($lat,$lon)"; }
my $sql="INSERT INTO `integer_test` VALUES $insert;"; #print "SQL1 is $sql1\n\n"; $dbh->do($sql);
} #After populating table, we create indexes. #print "Creating index... This may take some time\n"; my $sql='CREATE INDEX index1 ON integer_test (lat,lon);'; $dbh->do($sql); }
sub run_tests{ #Parameters: Points in field; Size of tile in average number of points my $number_of_points=$_[0]; my $target_query_return=$_[1]; my $proportion_of_each_axis=1/(sqrt($number_of_points/$target_query_return)); my $max_extent=1-$proportion_of_each_axis; #Maximum extent for query without extending beyond bound my $returnedrows; $query1total=0;
for($i=0;$i<$query_iterations;$i++){ #$lat1=(0.4+(rand ($max_extent/10))); #$lon1=(0.4+(rand ($max_extent/10)));
$lat1=int(rand ($max_extent*3599999999))-1800000000; $lon1=int(rand ($max_extent*3599999999))-1800000000;
$lat2=int($lat1+($proportion_of_each_axis*3599999999)); $lon2=int($lon1+($proportion_of_each_axis*3599999999));
if ($debug){print "querying bounds $lat1 $lon1 $lat2 $lon2 with queries \n";}
$query1="SELECT lat,lon from integer_test WHERE lat>$lat1 and lat<$lat2 and lon>$lon1 and lon<$lon2";
#print "Query 1: $query1\n"; $t0 = Time::HiRes::time(); my $sth = $dbh->prepare($query1); $sth->execute(); $returnedrows+=$sth->rows(); #$sth->finish; $elapsed=Time::HiRes::time()-$t0; #print "Fetched $rows points in $elapsed \n"; $query1total+=$elapsed;
} #print $returnedrows; if ($debug){print "Each of the $query_iterations returned an average " . $returnedrows/$query_iterations . "records\n";} #print "unit test time is " . ($query1total/$query_iterations) . "seconds \n\n"; return ($query1total/$query_iterations); }




