How BLOB performance can get improved on SQL Anywhere, and what factor one will gain.
Document Management Systems (DMS; in the very core a combination of a history-keeping file archive and a full text search index) are on the rise these days and a typical implementation is to store files into relational databases (RDBMS). In the last weeks I had the chance to talk to the head of development of a DMS vendor. He told me about some performance problems he noticed when using SQL Anywhere as a back end store. As I know this particular product quite well for more than a decade, I sat up a lab server and micro benchmark according to his implementation details and did some tests. The result is quite interesting to all SQL Anywhere users dealing with Blobs and JDBC.
“Everbody who measures measures crap.” (German Saying)
Micro benchmarks typically do not tell much about reality.The reason is that in reality the circumstances are much more complex than the purely synthetical benchmark could reflect. In this particular case, this is different: The original performance problem was recorded by observing the actual behaviour of real DMS machines. The micro benchmark then was built as an extraction (simplified copy) of the real JDBC code that the DMS software is executing in production, to be able to find a solution to the described problem, which was described as precisely as: “SELECTing and INSERTing BLOBs using SQL Anywhere is rather slow compared to other DBMS, at same conditions!”.
While I actually don’t care for what other DBMS do, I actually do care about performance problems of SQL Anywhere, since I have to support hundreds of SQL Anywhere servers running at our customers. Least of them had significant performance problems, and none of them had problem which couldn’t get fixed easily. So I was eager to find out about the actual cause and how to fix it.
Only BLOBs pretended slow
The software developers told me that they had already stripped down the problem to the sole fact that it really is only slow when dealing with BLOBs while at all other operations the actual performance was comparable to other DBMS. Also they told me that they were using jConnect (since only this driver supports Statement.addBatch() which is an invariant with their software) and SNAPSHOT isolation (since some of their background processes is running for one hour locking virtually all rows, which is inacceptable).
So I set up the micro benchmark found to run several tests. To not deal with already solved problems and old bugs, I first installed latest patches of everything: Windows 7, Drivers, SQL Anywhere 188.8.131.526 and jConnect 7. Also I defragmented the single drive used (more drives are out of scope to be comparable to other DMBS measured also on a single drive) after cleaning it from old stuff. In additon, Java was updated to 1.6.0_18. The benchmark itself is a rather hack, as I modified it’s code for each combination of features and as I did not plan to keep it anyways. As it didn’t contain much more than a few lines, I do not see a benefit of posting it.
Since the absolute numbers had been of no interest but only the relative improvement between combinations of options, I was not using a real server for the micro benchmark, but executed it on my laptop. A real server obviously will be much faster, which on the other hand will not change anything in the actual result of the test, which is only comparing my machine with itself in a different usage pattern – what will be the same if scaled up. For those interested in the actual test environment, here it is:
- Hardware: Laptop with Dual Core CPU (AMD® Turion® X2-64, 2 GB RAM, single 3.5″ SATA HDD (average transfer rate: 48 MB/s measured using HD Tune 2.55), Windows® 7 Professional 32 Bit)
- Software: Sybase SQL Anywhere 184.108.40.2066 with latest “SQL Anywhere” JDBC driver contained in that box (which is neither “jConnect” nor “iAnywhere”, see http://iablog.sybase.com/hinsperg/2009/10/connecting-to-sql-anywhere-using-jdbc/) and jConnect 7, Sun Java SE 1.6.0_18
As time is a scarce resource I had to concentrate on the most intersting things to check, so after some study of the SQL Anywhere manual I decided for the following things to evaluate:
- Preallocation of disk space – Assumption: BLOBs are huge so preallocating lots of space should have strong effect especially on DMS-type applications.
- Preallocation of RAM – Assumption: BLOBs are huge so dynamic fighting for lots of RAM could become a bottleneck.
- Modified Page Size – Assumption: The typical BLOB is around 512K according to the vendor, so a modified page size might have a positive effect.
- SNAPSHOTs – Assumption: Keeping a history in temp space could become a bottleneck since BLOBs are huge.
- APIs – Assumption: Some APIs might be more efficient than others.
- Transport – Assumption: SharedMemory should be faster than TCP/IP
- Driver – Assumption: My experience with our customer’s OLTP and OLAP use of SQL Anyhwere showed a significant performance gain by moving away from jConnect, so this should also be measurable with BLOBs.
So after spending a sunny day at the end of March, I had measured the following numbers. Remember, this is just a micro benachmark reflecting one single use case – the performance hot spot reported by a single vendor for a single application. The only valid interpretation is to see what is beneficial to do and what is not. It is invalid to assume that SQL Anywhere is particularly slow or fast compared to other products. But is is valid to assume that the relative relations are valid always for SQL Anywhere in other environments. And this is what we wanted to learn. So here it is:
- Write Test
Filling a table with 1024 rows holding one BLOB of 512KB each (i. e. writing more than 512MB to disk) in a single transaction (we want to measure throughput, not the speed of COMMIT — multiple transactions results in half performance approximately which can possibly be compensated by locating transaction log file on a second HDD device — this assumption was checked by doing one last test drive with single transactions again):
- 3 MB/s without any “tricks”
- 15 MB/s with HDD preallocation
- 17 MB/s with HDD and RAM preallocation
- 24 MB/s with HDD and RAM preallocation, using 16KB page size
- 29 MB/s with HDD and RAM preallocation, using 32KB page size
- 28 MB/s with HDD and RAM preallocation, using 16KB page size, using SNAPSHOT isolation
- 27 MB/s with HDD and RAM preallocation, using 16KB page size, using setBinaryStream(int, InputStream, int) instead of setObject(int, byte)
- 21 MB/s with HDD and RAM preallocation, using 16KB page size, using setBinaryStream(int, InputStream, int) instead of setObject(int, byte), using TCP/IP instead of Shared Memory
- 14 MB/s with HDD and RAM preallocation, using 16KB page size, using setBinaryStream(int, InputStream, int) instead of setObject(int, byte), using jConnect (TDS on localhost)
- Read Tests
Full Table Scan (can possibly be further optimized using clustered index when sorting is an issue, which was not enabled in the benchmark / benefits slightly from partially loaded cache, which is a result of the previous table filling test)
- 55 MB/s without any “tricks” (in reality it would be worse since this would not happen in one step but over time)
- 360 MB/s with HDD preallocation.
- 16000 MB/s with HDD and RAM preallocation
- 5389 MB/s with HDD and RAM preallocation, using 32KB page size
- 10893 MB/s with HDD and RAM preallocation, using 16KB page size
- 2048 MB/s with HDD and RAM preallocation, using 16KB page size, using SNAPSHOT isolation
- 172 MB/s with HDD and RAM preallocation, using 16KB page size, using getBlob(int).getBytes(int, int) instead of getBlob(int)
- 133 MB/s with HDD and RAM preallocation, using 16KB page size, using getBinaryStream(int).read(byte) instead of getBlob(int)
- 50 MB/s with HDD and RAM preallocation, using 16KB page size, using getBinaryStream(int).read(byte) instead of getBlob(int), using TCP/IP instead of Shared Memory
- 8 MB/s with HDD and RAM preallocation, using 16KB page size, using getBinaryStream(int).read(byte) instead of getBlob(int), using jConnect (TDS on localhost)
- Primary Index Scattered Read (benefits heavily from already loaded cache, which is a result of the previous full table scan test)
- 1438 MB/s without any “tricks”
- 1556 MB/s with HDD preallocation
- 1635 MB/s with HDD and RAM preallocation
- 1822 MB/s with HDD and RAM preallocation, using 32KB page size
- 1723 MB/s with HDD and RAM preallocation, using 16KB page size
- 994 MB/s with HDD and RAM preallocation, using 16KB page size, using SNAPSHOT isolation
- 157 MB/s with HDD and RAM preallocation, using 16KB page size, using getBlob(int).getBytes(int, int) instead of getBlob(int)
- 124 MB/s with HDD and RAM preallocation, using 16KB page size, using getBinaryStream(int).read(byte) instead of getBlob(int)
- 45 MB/s with HDD and RAM preallocation, using 16KB page size, using getBinaryStream(int).read(byte) instead of getBlob(int), using TCP/IP instead of Shared Memory
- 7 MB/s with HDD and RAM preallocation, using 16KB page size, using getBinaryStream(int).read(byte) instead of getBlob(int), using jConnect (TDS on localhost)
As already told, the actual numbers are of no interest, only the relative factors are of interest, as this is not the production server machine and we just wanted to learn what would be beneficial to do, not what the absolute throughput would be like. Looking at these factors, it is interesting to now proof our assumptions:
Preallocation of disk space
The tests clearly show that this has a tremendous effect upon performance. While this was obvious for writes, it turned out that it also is beneficial for reads. Why? Because reads suffer from fragmentation, which can get prevented using preallocation. Also, reads can use temporary disk space, and we also preallocated the temporary space. My tip would be: Unless you are scarce of disk space, you should preallocate it. It’s really worth doing it.
Preallocation of RAM
It was more or less clear that having more RAM right from the start would be beneficial, but actually I was astonished that the effect would be so string. Remember, we did not compare less RAM with lots of RAM, but only it’s preallocation. As the server will pick the RAM anyways, it seems it is a cheap trick to improve performance. So unless RAM is needed by a different process (which I would never allow on a production server), this is possibly the easiest way to gain measurable performance.
Modified Page Size
This is hard to interprete. The problem is that with several runs, performance once was brilliant, once was horrible. It just seems as the default is a really good allround-choice. As in real world not all BLOBs will have the same size (512K was just an average value) I actually would not touch this option without running further intensive tests in a real production system.
Undoubtly, SNAPSHOT isolation is a good thing to increase concurrency (i. e. the ability to access the same information by different users at the same time), so in practice it will prevent people from seeing the clock cursor too often. But for the single transaction, it is a measurable drawback. So unless needed, I would stick with the default setting, which is not using SNAPSHOTs. An idea could be to break down the background process in several smaller ones, like SQL Anyhwere does itself with REORGANIZE TABLE for example. The smaller transactions will effectively reduce the likeliness of making other users wait, so SNAPSHOTs are possibly not used anymore.
One of the big surprises for the average coder might be that the API has tremendous effect upon performance. This looks strange in the first view, but when start thinking, it is getting clear. First of all, some APIs do not force the driver to actually transmit the BLOB (like get Blob()) since a Java Blob is not the actual data, it is just a pointer to it (the transfer can be deferred until you like to access the actual Bytes). Second, read(byte) enforces the JVM to actively copy bytes from the RAM already allocated by the driver into a second piece of RAM provided by the caller, while getBytes() just can return the driver’s piece of RAM without doing a copy. So it is clear that getBlob() is the fastest API (especialy when you do not want to actually transfer the bytes into the JVM but just want to have a reference to a BLOB to farward it back to the server in a second step, like INSERT (blob) after doing SELECT blob). If you want to get the bytes, you should not force the driver to make a copy. The slowest means, actively copying bytes, should be used only if you really need a stream – what unfortunately is the case with the DMS.
The DMS vendor told me that typically his thin server is located on the same machine as the DBMS, so he will be glad to hear that SharedMemory has a huge effect. With this sole optimization, a performance gain of more than 50% is possible. Unless a second machine is really needed, SharedMemory should be your default choice of transport.
People often think that jConnect should be faster than iAnywhere or SQL Anywhere drivers, since it prevents copying data between the VM’s RAM and the OS’s RAM. While this might have been true in the past (actually I didn’t measure it), this is untrue with current drivers. I have not tested the iAnywhere driver, but the SQL Anywhere driver (the one that was introduced with 11.0.1) outperforms the jConnect driver by far. The exactly same, unchanged code executes more than six times faster on the native driver. When taking into account that the native driver can do things that jConnect cannot, namely getBlob().getBytes() and SharedMemory transport, it is not more than six but actually more than twenty two times (22x!) faster than jConnect. Remember, this comes for free as the native driver comes out of the box and the DMS server sits mostly on the same machine acting just as a data-passing protocol head. In rare cases, e. g. when you don’t actually need the blob’s content “on the screen”, the effect could be even bigger. Even when taking into account that the Java VM’s timer is not very accurate and this micro benchmark is in this place purely hypothetical, the native driver shows in some tests a performance factor of two thousand times (2000x!). While it is clear that this will never actually happen in the real world and is a purely synthetic effect of the micro brenchmark, it does proof that the native driver can be even faster than 22 times compared to jConnect – even to the latest jConnect 7. And, SQL Anywhere 12 (“Innsbruck”) will come with an even improved native driver, which in addition is JDBC 4 compliant.
Whether or not SQL Anywhere is faster or slower than another RDBMS was out of scope of the benchmarking and hard to tell unless vendors give up their restrictions about publishing benchmark results, but tests clearly showed that under the same conditions a move from jConnect to a native driver is worth always and will gain a measurable benefit in the range of about six to twenty two times. Unless your client needs to run on an operating system which is not supported by SQL Anywhere (what will be hard to find in production environments), the native driver should be the default choice, even if this test only covered one particular performance hotspot of one particular client application.