Thursday, 28 May 2015

3 Must-Read Books About Database Performance

SQL Performance Explained: Everything Developers Need to Know about SQL Performance

An in-depth book on how to improve database performance. The focus is on relational databases and it covers all major SQL databases without getting lost in the details of any one specific product. Starting with the basics of indexing and the WHERE clause, SQL Performance Explained guides developers through all parts of an SQL statement and explains the pitfalls of object-relational mapping (ORM) tools like Hibernate.

PostgreSQL 9.0 High Performance

An excellent book for intermediate to advanced PostgreSQL database administrators (DBA). Teaches everything about building, monitoring and maintaining a PostgreSQL installation while also providing useful and interesting information about database internals. If you truly want to understand how PostgreSQL operates and behaves under a high load this is a book for you.

High Performance MySQL: Optimization, Backups, and Replication

Advanced techniques for everything from designing schemas, indexes, and queries to tuning your MySQL server, operating system, and hardware to their fullest potential. This guide also teaches you safe and practical ways to scale applications through replication, load balancing, high availability, and failover.

Wednesday, 15 October 2014

DatabasePack Beta: Database-as-a-Service for testing

DatabasePack is a new service that provides access to all major versions of MySQL, PostgreSQL and SQL Server. The service makes testing code on multiple database versions easier because they are already hosted and are available just a few clicks away. Usually you would need to setup all databases and their different versions manually, and if you have tried that you probably know how frustrating that can be. Hosting each database version on a separate server could also get quite expensive and would be a maintenance hell. DatabasePack is powered by Docker, an open platform for developers and sysadmins to build, ship, and run distributed applications.

Screenshot of DatabasePack Web Interface

Apart from offering databases as a service, they plan to launch a feature for test data generation. It would help to create extensive test data automatically, but DatabasePack also supports importing CSV, SQL or JSON files.

The service is currently in beta, but you can request an invite on their landing page.

Monday, 10 March 2014

Data Visualization Mistakes

Data visualization simply means the creation and analysis of data represented visually. Data abstraction and representation is done in various forms with several attributes and variables serving as units of information. Data visualization is considered the best way to understand any given data. However, there are some mistakes that people need to avoid in order to more efficiently understand data.

1. Error in Chart Percentages

It’s common to employ pie charts and tables for the purpose of visual representation. While including charts in data representation is not difficult, more often than not users can trip and make mistakes. One of the most common mistakes with pie charts is to divide them into percentages that simply do not add up. The basic rule of a pie chart is that the sum of all percentages included should be 100%. Consider this example:

The percentages not only fall short of 100%, the segment sizes also do not match their values. This can happen because of various reasons such as rounding error or a miscalculated percentage. This can also happen when non-mutually exclusive categories are plotted on the same chart. Unless the included categories are mutually exclusive, their percentage cannot be plotted separately using the same chart. For such categories, it is better to make use of separate charts.

2. Ambiguous Representation of Data

Ambiguity is not always intentional and can creep into data visualization quite often. It is important that you use accurate data in plotting the graphs but it is equally important that you avoid the use of too exotic graphs. Such graphs usually result in diverting the attention of the reader from the actual data. Use the attributes of color, brightness and saturation only where they are needed. Efficient use of labels and other marks is also useful to clarify different aspects of the data.

Example of Ambiguous Representation of Data

The chart represented above should have less saturation and brightness to make it more comprehensible and clear.

3. Displaying Too Much Data

People are usually looking for specific information when they are scanning through a data visualization document. So it is very important that only relevant, specific and concrete data is represented while leaving out anything that is irrelevant. Presence of irrelevant data, whether in the form of tables or charts, makes finding the required information difficult. This is also related to the cluttering in graphs. It is always better to use several graphs to represent related quantities than putting them all into a single graph and cluttering it.

A few simple and easy to read graphs are always better than one complicated and cluttered representation of data. Similarly, the choice between a bar chart and a pie chart can also affect the clarity of the representation.

The data is clearly congested and it would have been better to represent it in the form of a bar chart. A bar chart would also allow comparison between different units.

4. Consistency of Data Visualization

One of the most common mistakes in data visualization is to represent data using various kinds of visualizations. Good practice is always to stick with a particular kind of data visualization technique and retain it to the end. With different visualization techniques applied at the same time, a reader needs to comprehend each part differently before moving on to the next one. This can result in loss of data. In order to make the audience understand the information more efficiently, it is better to keep the visualizations consistent.

5. Keep It Simple

The most important lesson for data visualization is, just like everything else, not to let go of simplicity. It’s natural to feel that a more embellished or artistic representation would result in more clarity but, more often than not, it does not. Besides, this practice also results in distracting the people from the actual data. For instance, look at the example below.

There are several ambiguous things in the chart. It is not clear why the first image is blue and the rest are red. Further, the number in the second image is against the paintbrush and not against the head while in all other columns it is against the head. But a user might just appreciate different figures and think about the real-life characters represented by them and move on without understanding the data. The importance of visual representation of data has increased with the advent of mobile technology because of easy access to the internet. However, it is very important that visual representation of data is free of the pitfalls that make data representation ambiguous and irrelevant.

Saturday, 1 March 2014

Favorite Relational Database

Last week we conducted a small research in the form of a poll: "What is your favorite RDBMS?". The poll received more responses than anticipated – a little bit over 900 votes were submitted. The topic in question can induce flame wars, but this time it passed without major incidents. There were several attempts to vote for MongoDB, which is not a relational database management system, so we had to remove it from results. MongoDB is often considered a contentious subject when speaking in context of relational databases. To put in redditor's day_cq sarcastic words:

"mongodb all the way. full web scale power. i think mongodb should be written in nodejs to take advantage of event driven nature of fast web scale loop. and gruntjs and lessc and uglifyjs adds full power to angulrjs real time web scale RDBMS."

If you don't see a chart below, please view it here: Favorite relational database management system. Full results are shared publicly and we would like to invite you to create your own representation of the data.

Thank you all for participating!

Update: Tyson Hollas has submitted a better representation of the data. See the chart below.

Thursday, 27 February 2014

History of MySQL

MySQL is one of the most widely used open source relational database management systems in the world. With a total distribution amounting to more than 100 million worldwide, the software has become the first choice of large data management corporations spanning over a wide range of internet technologies.

Inception

MySQL was created by a Swedish company MySQL AB in 1995. The developers of the platform were Michael Widenius (Monty), David Axmark and Allan Larsson. The foremost purpose was to provide efficient and reliable data management options to home and professional users. Over half a dozen alpha and beta versions of the platform were released by 2000. These versions were compatible with almost all the major platforms.

Open Source Status

Originally the property of MySQL AB, the platform went open source in 2000 and began following the terms of GPL. Going open source resulted in a significant drop in revenues which were, however, recovered eventually. The open source nature of MySQL has made it open for the contributions of third party developers.

Expansion in Business

MySQL gained steady popularity among home and professional users and by 2001, the platform had 2 million active installations. In 2002, the company expanded its reach and opened US headquarters in addition to Swedish headquarters. Same year, it was announced that the membership of the platforms exceed 3 million users with revenue amounting to $6.5 million.

First Law Suit

MySQL AB also faced its first major lawsuit in June 2001 when it was sued by NuSphere in US District Court in Boston. The charges included violation of third party contracts and unfair competition. In repose, MySQL AB sued NuSphere in 2002 for copyright and trademark infringement. Both companies reached at a settlement after preliminary hearing on 27 February 2002.

Shift in Strategy

The platform continued to gain popularity and by the end of 2003, it could boast total revenue of $12 million with 4 million active installations. In 2004, the company decided to focus more on recurring end user revenue instead of one-time licensing fee. The strategy proved to be profitable and the year ended with net revenue of $20 million.

Oracle’s Acquisition of Innobase

In 2005, Oracle purchased Innobase, the company which managed MySQL’s Innobase storage backend. This storage engine allows MySQL the implementation of important functions such as transactions and foreign keys. Same year, MySQL Network developed on the lines of RedHat Network was launched. This resulted in MySQL 5 which significantly expanded the feature-set available for the enterprise users. Following year, the contract between MySQL and Innabose was renewed.

Further Acquisitions of Oracle

In 2006, Oracle also purchased Sleepycat, the company that manages the Berkeley DB transactional storage engine of MySQL. However, this acquisition did not have any major effect because Berkeley DB was not widely used and was not included in the versions of MySQL launched in October 2006. Meanwhile, the popularity of the company continued to increase with 8 million active installations in 2006. By the same year, MySQL had 320 employees in 25 countries. The distinguishing feature of MySQL employees was that 70 % of them worked from home, thanks to the open source nature of the platform. Revenues of the company reached $50 million by the end of 2006 and by the end of the following year total revenues were $75 million.

MySQL Acquired by Sun Microsystems

In January 2008, MySQL was acquired by Sun Microsystems for $1 billion. The decision was criticized by Michael Widenius and David Axmark, the co-founders of MySQL AB. At that time MySQL was already the first choice of large corporations, banks and telecommunications companies. The CEO of Sun Microsystems, Jonathan Schwartz, called MySQL “the root stock” of the web economy.

Oracle’s Acquisition of Sun and MySQL

Sun’s acquisition of MySQL did not prove very fruitful and in April 2009, an agreement was reached between Sun Microsystems and Oracle Corporation according to which Oracle was to purchase Sun Microsystems along with MySQL copyrights and trademark. The deal was approved by the U.S. government on 20 August 2009. As a result of online petition started by one of the founders of MySQL Monty Widenius, Oracle faced a few legal complications with European Commission. However, the problems were resolved and in January 2010 the acquisition of MySQL by Oracle became official.

MySQL Forks

Michael Widenius left Sun Microsystems after it was acquired by Oracle and eventually developed a fork of MySQL called MariaDB. Forks are certain related projects that can be considered the mini-versions of standard MySQL. To date, several such versions have been launched which aim at providing specific functionality. Maria DB is a community-owned fork which means that it would not have any usual MySQL license restrictions that the standard version has. It is compatible with MySQL binary library so there isn’t any difference between commands and the APIs.

Drizzle is another fork that is mainly developed for cloud computing markets. Thus features that are not required for cloud computing are removed from the standard version to make it faster and lighter. Drizzle was originally developed by Brian Aker in 2008. The first GA version of Drizzle was launched in March 2011.

Percona Server is a MySQL fork that incorporates XtraDB storage engine. It provides various new features for data analysis and management. Few forks have also been discarded over the years. Among such forks we have OurDelta which was a combination of various patches.

A lot of work performed by forks is centered on the InnoDB storage engine. This is true for main patches included in Percona, OurDelta, XtraDB and MariaDB's replacement engine Maria. Forks are just the diverse implementations of standard MySQL aiming at specific functions.

MySQL and Cloud Computing

The older versions of MySQL were only developed for the conventional machines. However, with the advent of Cloud Computing, MySQL was also made compatible with various cloud computing services such as Amazon EC2. Various deployment models have been used for the implementation of MySQL on cloud computing platforms. Perhaps the most popular of these models is ‘Virtual Machine Image’, which allows the use of a ready-made machine image where MySQL is pre-installed.

A second cloud computing model is Managed MySQL cloud hosting where the database is not available as a service but is hosted and managed on the behalf of the owner. This mode, however, is offered only by a handful of companies. With the expansion in cloud computing and related technology, MySQL versions for cloud computing are also expected to increase in number.

Monday, 24 February 2014

Poll: What is Your RDBMS of Choice?

We have created a new poll to learn your favorite relational database management systems.

Poll: What is Your RDBMS of Choice?

Voting ends on Friday and results will be published soon after that in Database Friends blog. Stay tuned!

Monday, 17 February 2014

TokuMX – High-Performance MongoDB Distribution

TokuMX is a high-performance version of MongoDB. It is an open-source project by Tokutek, a database company that focuses on Big Data solutions. TokuMX is a more performant than MongoDB and has greater operational efficiency. Tokutek claims on their homepage:

"TokuMX is a drop-in replacement for MongoDB, and offers 20X performance improvements, 90% reduction in database size, and support for ACID transactions with MVCC."

TokuMX introduces patented Fractal Tree indexing technology (Technology Overview in PDF), which improves upon MongoDB's default B-tree indexing. Fractal Tree indexing is based on cache-oblivious algorithm and it implements the same operations as a B-tree. Fractal Tree indexes uses large, less frequent writes instead of small and rapid ones, which gives a benefit for compression and insertion performance.

Key features overview:

Recently Tokutek released TokuMX version 1.4, which brings us improvements to MongoDB sharding and replication among other useful features. For more information on this release, please, check out full release notes.

TokuMX open-source engine can be found on Github as Tokutek/mongo (GNU General Public License). There is also an Enterprise version with proprietary EULA. To obtain Enterprise version, a subscription must be purchased. It gives customers the following benefits: Technical Support, Onboarding Call and access to Advanced Tools (e.g., Hot Backup).

External links
Product homepage - TokuMX
TokuMX is MongoDB on steroids (MySQL Performance Blog)
Wikipedia Article

Tuesday, 28 January 2014

Daily Database Links

Below you can find a daily list of interesting database related articles.

Xkcd: Permanence

Thursday, 18 July 2013

Percona Toolkit 2.1.10 is now available

Percona has announced a release of Percona Toolkit 2.1.10. Percona Toolkit is a fork of the Maatkit database utilities created by former Percona employee Baron Schwartz.

Percona Toolkit logo

This release of Percona includes several bug. Some of the most important ones:

Improved pattern matching that caused the pt-deadlock-logger error when a different timestamp format was used
Fix for the new --utc option for pt-heart
Fix for pt-table-checksum using the first non-unique index instead of the one with the highest cardinality

Neo4j 1.9.2 now available

Today was released the latest version of the 1.9 series of Neo4j, version 1.9.2. While there aren't any new API-level features in this release, there are plenty of goodies under the covers.

Neo4j logo

Improvements:

Reduced the amount of IO that Neo4j performs in some cases when running on Windows.
Many tweaks and bug fixes that improve the overall experience of running and querying Neo4j
Some fixes to the REST API to ensure use of the Content-Encoding header properly

Full release announcement | Neo4j Blog

Pages