Storing multilingual records in the MySQL database

This article was published in the April 2009 issue of php|architect.

Multilingual applications use two types of texts requiring translation. The first ones are messages without a relation to the data (for example the title "Contents") which can easily be translated by GNU gettext or by a simple database table. The second type are texts related to the data (for example a German and Italian title for the same product with a given price). This article will examine database strategies for this second type of text.

Multilingual applications should use simple SQL queries (to avoid bugs), keep the performance high (which mainly means to utilize indexes) and produce the exact output (sort according to each language collation and respect the fulltext threshold). In additionally database consistency should be maintained, for example to preserve uniqueness of URLs. This article compares three different approaches for storing multilingual records considering these goals. In addition, the complexity of adding a new language and working with untranslated texts are covered.

This article demonstrates all approaches on a simple table of products with translatable names, URLs and descriptions, group identifiers and prices (common for all languages). The application needs to filter the products by the group identifier and then sort them by the name and we will perform a fulltext search on the name and description in a specified language. Therefore, in a single language environment, the table would look like this:

CREATE TABLE `product` (
	`product_id` int(11) NOT NULL auto_increment,
	`group_id` int(11) NOT NULL,
	`name` varchar(50) NOT NULL,
	`url` varchar(50) NOT NULL,
	`price` decimal(9, 2) NOT NULL,
	`description` text NOT NULL,
	UNIQUE KEY `url` (`url`),
	KEY `group_id` (`group_id`, `name`),
	FULLTEXT KEY `name` (`name`, `description`),
	PRIMARY KEY (`product_id`)
);

The unique index above URL guarantees that there can be only one product at each location. The query to retrieve the products in a given group sorted by the name is simple (PHP syntax is used):

"
	SELECT `url`, `name`, `price`
	FROM `product`
	WHERE `group_id` = $group_id
	ORDER BY `name`
	LIMIT 30
"

The usage of indexes is of course correct:

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	SIMPLE	product	ref	group_id	group_id	4	const	9	Using where

Translations in a separate table

We can store the translatable columns in a separate table:

CREATE TABLE `language` (
	`language_id` char(2) NOT NULL,
	`collation` varchar(64) NOT NULL,
	PRIMARY KEY (`language_id`)
);

CREATE TABLE `product` (
	`product_id` int(11) NOT NULL auto_increment,
	`group_id` int(11) NOT NULL,
	`price` decimal(9, 2) NOT NULL,
	KEY `group_id` (`group_id`),
	PRIMARY KEY (`product_id`)
);

CREATE TABLE `product_translation` (
	`product_id` int(11) NOT NULL,
	`language_id` char(2) NOT NULL,
	`name` varchar(50) NOT NULL,
	`url` varchar(50) NOT NULL,
	`description` tinytext NOT NULL,
	UNIQUE KEY `language_id_2` (`language_id`, `url`),
	KEY `language_id` (`language_id`, `name`),
	FULLTEXT KEY `name` (`name`, `description`),
	FOREIGN KEY (`language_id`) REFERENCES `language` (`language_id`),
	FOREIGN KEY (`product_id`) REFERENCES `product` (`product_id`),
	PRIMARY KEY (`product_id`, `language_id`)
);

(The `language`.`collation` column contains one of the collations returned by the SHOW COLLATION command and we need it to properly sort the results.)

With this schema, a simple query to retrieve the products in a given group sorted by the name is not so simple:

"
	SELECT `product_translation`.`url`, `product_translation`.`name`, `product`.`price`
	FROM `product`
	INNER JOIN `product_translation` USING (`product_id`)
	WHERE `product`.`group_id` = $group_id AND `product_translation`.`language_id` = '$language_id'
	ORDER BY `product_translation`.`name` COLLATE $collation
	LIMIT 30
"

(Variable $collation is taken from `language`.`collation` in the application initialization.)

The query is complex and its execution plan is bad:

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	SIMPLE	product	ref	PRIMARY, group_id	group_id	4	const	12	Using temporary; Using filesort
1	SIMPLE	product_translation	eq_ref	PRIMARY, language_id_2, language_id	PRIMARY	10	product.product_id, const	1	Using where

Because of filesort, MySQL has to read all entries and cannot use the LIMIT clause to skip unnecessary rows. It is N * log N where as to perform the ordering with LIMIT clause would be a constant! MySQL cannost use the index on (language_id, name) for sorting at all because we need a different collation for every language.

What about fulltext search? MySQL natural language search ignores words occurring in at least half of the rows. Because all languages are mixed in one column, this feature will likely exclude most words and thus be useless. Fulltext search can be good in some scenarios but it has negative impact on performance. Moreover, MySQL cannot use a fulltext index together with a normal index so the fulltext must search in all languages and then restrict the results by hand.

In a real-world application, we have to deal with untranslated texts. Usually, we use some default language in the place of them. With this table schema, we have two options available to solve this problem:

One option is to copy the default language version of all texts to the newly added language. This greatly complicates adding a new language to the application. It also complicates adding a new product. Unique index above URL would work but then we would have to insert all translations in a transaction and rollback the transaction in a case of conflict.

The second option is to use the default language version if a translation does not exist. This makes Adding a language very simple (just inserting a row in the language table) but the so called "simple query" gets complicated even more

"
	SELECT IFNULL(`product_translation`.`url`, `product_default`.`url`) AS `url`, IFNULL(`product_translation`.`name`, `product_default`.`name`) AS `name`, `product`.`price`
	FROM `product`
	INNER JOIN `product_translation` `product_default` USING (`product_id`)
	LEFT JOIN `product_translation` ON `product`.`product_id` = `product_translation`.`product_id` AND `product_translation`.`language_id` = '$language_id'
	WHERE `product`.`group_id` = $group_id AND `product_default`.`language_id` = '$default_id'
	ORDER BY IFNULL(`product_translation`.`name`, `product_default`.`name`) COLLATE $collation
	LIMIT 30
"

Unique index above URL is useless as we would have to check for existence of the same URL in a trigger or a stored procedure at the time of inserting, updating and deleting a product translation.

Data copy

This approach copies also the data common to all languages.

CREATE TABLE `language` (
	`language_id` char(2) NOT NULL,
	`collation` varchar(64) NOT NULL,
	PRIMARY KEY (`language_id`)
);

CREATE TABLE `product` (
	`product_id` int(11) NOT NULL auto_increment,
	PRIMARY KEY (`product_id`)
);

CREATE TABLE `product_translation` (
	`product_id` int(11) NOT NULL,
	`language_id` char(2) NOT NULL,
	`group_id` int(11) NOT NULL,
	`name` varchar(50) NOT NULL,
	`url` varchar(50) NOT NULL,
	`price` decimal(9, 2) NOT NULL,
	`description` text NOT NULL,
	UNIQUE KEY `language_id_2` (`language_id`, `url`),
	KEY `language_id` (`language_id`, `group_id`, `name`),
	FULLTEXT KEY `name` (`name`, `description`),
	FOREIGN KEY (`product_id`) REFERENCES `product` (`product_id`),
	FOREIGN KEY (`language_id`) REFERENCES `language` (`language_id`),
	PRIMARY KEY (`product_id`, `language_id`)
);

(The product table can be just a sequence.)

The group identifier and the price are copied to all languages. Technically a trigger can achieve it, however we have to define a trigger on all translatable tables in the database and it has to respect all columns in the table (so addition or modification of a table would be difficult). The data also takes more space so the working set is less likely to fit in memory.

The simple query remains simple:

"
	SELECT `url`, `name`, `price`
	FROM `product_translation`
	WHERE `language_id` = '$language_id' AND `group_id` = $group_id
	ORDER BY `name` COLLATE $collation
	LIMIT 30
"

Even the execution plan would be good only if it were not for the COLLATE clause:

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	SIMPLE	product_translation	ref	language_id_2, language_id	language_id	10	const, const	9	Using where; Using filesort

The COLLATE is causing a filesort which is bad for the LIMIT clause. The query would be fast without it but the results would not respect the sorting order specific for the language.

Both problems with fulltext search remain.

We have to solve the same dilemma as in the previous approach regarding untranslated texts and adding a new language. Either the queries will be complicated and slower or the language addition will be painful. Unique index would not work in the case of falling back to the default language or we would have to do all modifications in a transaction when copying the default language.

Translation directly in the data table

The previous approaches have one basic problem - they are storing different data (texts with miscellaneous collations) to the same column. We have to separate the languages into the individual columns to solve this basic problem:

CREATE TABLE `language` (
	`language_id` char(2) NOT NULL,
	`collation` varchar(64) NOT NULL,
	PRIMARY KEY (`language_id`)
);

CREATE TABLE `product` (
	`product_id` int(11) NOT NULL auto_increment,
	`group_id` int(11) NOT NULL,
	`name_en` varchar(50) NOT NULL,
	`name_cs` varchar(50) collate utf8_czech_ci NOT NULL,
	`url_en` varchar(50) NOT NULL,
	`url_cs` varchar(50) collate utf8_czech_ci NOT NULL,
	`price` decimal(9, 2) NOT NULL,
	`description_en` text NOT NULL,
	`description_cs` text collate utf8_czech_ci NOT NULL,
	UNIQUE KEY `url_en` (`url_en`),
	UNIQUE KEY `url_cs` (`url_cs`),
	KEY `group_id_en` (`group_id`, `name_en`),
	KEY `group_id_cs` (`group_id`, `name_cs`),
	FULLTEXT KEY `name_cs` (`name_cs`, `description_cs`),
	FULLTEXT KEY `name_en` (`name_en`, `description_en`),
	PRIMARY KEY (`product_id`)
);

(Notice the different collations for columns with translations.)

Each translatable field gets its own column in each language. The same holds for indexes - there are more of them but they are smaller.

The table is not nice but the simple query remains simple and produces correct results:

"
	SELECT `url_$language_id`, `name_$language_id`, `price`
	FROM `product`
	WHERE `group_id` = $group_id
	ORDER BY `name_$language_id`
	LIMIT 30
"

(The $language_id variable is whitelisted from `language`.`language_id`.)

It is also as fast as a single-language query:

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	SIMPLE	product	ref	group_id_en, group_id_cs	group_id_en	4	const	11	Using where

Both problems with fulltext search disappear. Each index has separate statistics and there is no need to filter by language identifier when performing the search.

Adding a language definitely will be a difficult operation so we do not need to solve the dilemma whether to store the default language to untranslated texts. It can be stored to keep query simplicity and speed. Unique indexes above URLs simply work.

Adding a new language will be a complex operation when this approach is used. We have to go through all tables and add a column for each language field. We also have to create new indexes. We can achieve it by a stored procedure or by a PHP function:

<?php
/** Addition of language columns to all tables
* @param string language identifier
* @param string default language identifier
* @param string way of ordering texts
* @return null execution of commands ALTER TABLE and UPDATE
* @copyright Jakub Vrana, https://php.vrana.cz/
*/
function add_language($language_id, $default_id = "en", $collation = "") {
    $result = mysql_query("SHOW TABLES");
    while ($row = mysql_fetch_row($result)) {
        
        // addition and copy of columns
        $alter = array();
        $update = array();
        $result1 = mysql_query("SHOW COLUMNS FROM $row[0]");
        while ($row1 = mysql_fetch_assoc($result1)) {
            if (substr($row1["Field"], -3) == "_$default_id") {
                $field = substr($row1["Field"], 0, -2) . $language_id;
                $null = ($row1["Null"] == "NO" ? " NOT NULL" : "");
                $alter[] = "ADD $field $row1[Type]" . ($collation ? " COLLATE $collation" : "") . "$null AFTER $row1[Field]";
                $update[] = "$field = $row1[Field]";
            }
        }
        mysql_free_result($result1);
        if ($alter) {
            mysql_query("ALTER TABLE $row[0] " . implode(", ", $alter));
            mysql_query("UPDATE $row[0] SET " . implode(", ", $update));
            
            // addition of indexes
            $indexes = array();
            $result1 = mysql_query("SHOW INDEXES FROM $row[0]");
            while ($row1 = mysql_fetch_assoc($result1)) {
                $type = ($row1["Index_type"] == "FULLTEXT" ? "FULLTEXT" : ($row1["Non_unique"] ? "INDEX" : "UNIQUE"));
                $indexes[$type][$row1["Key_name"]] .= "$row1[Column_name], ";
            }
            mysql_free_result($result1);
            foreach ($indexes as $type => $type_indexes) {
                foreach ($type_indexes as $index) {
                    if (strpos($index, "_$default_id, ")) {
                        mysql_query("ALTER TABLE $row[0] ADD $type (" . substr(str_replace("_$default_id, ", "_$language_id, ", $index), 0, -2) . ")");
                    }
                }
            }
        }
    }
    mysql_free_result($result);
}
?>

The maximum row size in MySQL is about 64 KB so with big tables and many languages, this limit could be reached. In this case, it is possible to separate some columns to a different table with one-to-one relationship. Another approach would be to store each language in a separate table and copy common columns the same way as we did in the Data copy approach. This however involves problems with trigger maintenance.

Other databases

Other database systems such as Oracle or PostgreSQL do not have support for a collation definition in each column and use a different approach for a language-dependent sorting. They define (PostgreSQL through an extension) an nlssort function accepting a string to sort and a language identifier to sort by. The database systems can use this function together with partial function indexes to sort the products effectively in the Data copy schema. However, we still have to define the indexes at the time of adding a language.

Final thoughts

For messages with no relation to the data (mentioned in the beginning of this article) we can use the same database schema to store translations. This way we can use the same infrastructure for translating both kinds of texts.
Translators need a list of texts required for translation. We cannot generate this list by comparing messages that are the same as the default language because some translations are simply the same as original. We can maintain this list in a separate table.
In the ideal world, all the translations would be stored in a single cell with complex data type. This data type would know how to sort the data in each language and how to use them in index. It would also know how to handle untranslated texts. However, we have to split the cell to either rows or columns in the world of relational databases.

Conclusion

Let's name pros and cons of all three approaches:

Translations in a separate table	Data copy	Translation in the data table
- slow sorting - slow fulltext search - complex SQL queries Copy of default language: - complex language addition - complex product insertion - extra table for untranslated texts Fallback to default language: + simple language addition - even more complex SQL queries - complex URL modifications	- slow or inaccurate sorting - slow fulltext search - bigger footprint - difficult table altering Copy of default language: + simple queries - complex language addition - complex product insertion - extra table for untranslated texts Fallback to default language: + simple language addition - complex SQL queries - complex any URL modification	- variable database structure - very complex language addition Copy of default language: + simple queries + fast sorting + fast fulltext search - extra table for untranslated texts

The third approach has a big handicap in the variable database structure (the user adding a language must have the ALTER privilege). It exceeds the other approaches in all other aspects and at the end of a day, we have to do the least work and still get a solution with the highest performance.

About the author

Jakub Vrána is one of the authors of official PHP manual, and he uses MySQL for developing web applications in PHP. He also teaches the MySQL basics at Charles University and conducts commercial trainings. He is the founder of compact MySQL management tool ~~phpMinAdmin~~ Adminer. Jakub lives in Prague, Czech Republic.

Jakub Vrána, Výuka

Comments

Patrik Votoček (Vrtak-CZ):

Bude česká verze?

24.8.2009 02:40:48

kluvi:

http://php.vrana.cz/ukladani-vicejazycnych-zaznamu.php
:-)nevim jestli je to uplne presne to stejny.. ale obsah je +/- stejnej :-)

24.8.2009 07:32:38

Jakub Vrána :

Ano, to je jednodušší článek na stejné téma. Kompletní překlad tohoto článku psát nebudu.

24.8.2009 11:37:35

Marek Hrabě:

Prosím: http://translate.google.com/translate?prev=…&history_state0=

Je to taková pseudo-čeština :D

Jinak koukni ta, kam psal kluvi...(#d-9028)
#d-9033 reply

Marek Hrabě:
Moc pěkný článek, Jakube! :) Jen by to možná chtělo zaměnit phpMinAdmin za Adminer...

(reply to) 24.8.2009 09:56:11

Jakub Vrána :

Díky za upozornění. Ale jde o přetisk článku z php|architecta, takže v něm výjimečně změny dělat nebudu.

(reply to) 24.8.2009 11:38:20

Mastodont:

Chybí mi tu čtvrtá možnost - samostatná tabulka pro každý jazyk překladu. Nevýhodou je sice proměnný počet tabulek, ale zase by se mělo značně vylepšit řazení.

24.8.2009 14:28:14

Jakub Vrána :

Ano, je to kompromis mezi druhým a třetím přístupem. Odstraňuje problémy

- slow or inaccurate sorting
- slow fulltext search

a naopak přidává

- variable database structure
- complex language addition (i ve variantě Fallback to default language)

24.8.2009 14:37:43

Prdlořeznictví Krkovička, n. p.:

In additionally - buď Additionally, nebo In addition,
MySQL cannost use the index - cannot nebo can not
+pár menších.
Dobrý článek. Díky, že ses o něj podělil.

24.8.2009 22:05:25

Jakub Vrána :

Korektury si v php|architectu naštěstí dělali sami, článek byl celý červený :-). Sem jsem ale asi dal verzi bez korektur, protože se mi je nechtělo přenášet z jejich formátu do HTML.

25.8.2009 10:43:27

Ludmila Pičmanová:

Ten limit 64 KB na řádku, ten platí pro součet všech sloupců kromě typů text a blob?

25.8.2009 17:03:32

Diskuse je zrušena z důvodu spamu.

Navigace

PHP triky

Kupte si mou knihu

Reklama

Hledáte-li programátora nebo naopak sami programujete a nemáte do čeho píchnout, využijte služeb portálu nezávislých profesionálů Na volné noze.

Web běží na serveru

Váš hosting

Články podle skupin

Výběr článků

Nejnovější články

Normy a manuály

Další projekty

. Můžeme si tykat. Skripty předpokládají nastavení: magic_quotes_gpc=Off, magic_quotes_runtime=Off, error_reporting=E_ALL & ~E_NOTICE a očekávají předchozí zavolání mysql_set_charset. Skripty by měly být funkční v PHP >= 4.3 a PHP >= 5.0.

Storing multilingual records in the MySQL database

Translations in a separate table

Data copy

Translation directly in the data table

Final thoughts

Conclusion

About the author

Comments

Patrik Votoček (Vrtak-CZ):

kluvi:

Jakub Vrána :

Marek Hrabě:

Jakub Vrána :

Mastodont:

Jakub Vrána :

Prdlořeznictví Krkovička, n. p.:

Jakub Vrána :

Ludmila Pičmanová:

Jakub Vrána :

Hari KT:

Jakub Vrána :

Navigace

PHP triky

Kupte si mou knihu

Reklama

Web běží na serveru

Články podle skupin

Výběr článků

Nejnovější články

Normy a manuály

Další projekty