]
The most fiddly bit I had to deal with was actually optional and involved
converting some of the ad-hoc character combinations that I'd added to my markup
language in order to use characters not supported in the latin1 encoding (e.g. I
would write Slavoj Žižek as "Slavoj Zizek). I used sed to replace these
character combinations. Obviously, no-one else uses these, but for the sake of
posterity -- and to serve as an example for other replacements -- here they are:
sed \
-e 's@(a_)@ā@g' \
-e 's@c@ç@g' \
-e 's@C@Ç@g' \
-e 's@(c-)@č@g' \
-e 's@(C-)@Č@g' \
-e 's@g@ğ@g' \
-e 's@i@ı@g' \
-e 's@I@İ@g' \
-e 's@(l-)@ł@g' \
-e 's@(L-)@Ł@g' \
-e 's@s@ş@g' \
-e 's@S@Ş@g' \
-e 's@(s-)@š@g' \
-e 's@(S-)@Š@g' \
-e 's@(u-)@ū@g' \
-e 's@z@ž@g' \
-e 's@Z@Ž@g' \
-e "s@(a\\\')@á@g" \
-e "s@(A\\\')@Á@g" \
-e "s@(C\\\')@Ć@g" \
-e "s@(n\\\')@ń@g" \
-e "s@(N\\\')@Ń@g" \
-e "s@(o\\\')@ő@g" \
-e "s@(O\\\')@Ő@g" \
earthli.sql > earthli_utf8.sql
I used @ as the separator character and had to escape the backslash twice (once
for sed and once for bash). Also, you have to use a different output file
because sed truncates the output before it does anything. If you use the same
file, then you just end up with an empty file. Neat.
It's not super-efficient, but it was done in a few seconds.
A bonus to doing these replacements for me is that a full-text search for
"Zizek" or "Žižek" now finds all articles where I mention the Slovenian
philosopher. That didn't work before because MySql was indexing "Zizek" instead.
[Working with the dump file]
If you need to open the dump file, be aware that the lines are very long. vim
does a good job of searching and editing and jumping to locations (e.g. +normal
15G25| jumps to line 15, column 25. nano can also find text ((-cmd) + W) pretty
well and quickly. Both edit the text without a problem, once you've found the
location you're interested in.
Desktop editors (e.g. Visual Studio Code or Sublime Text) and differs (e.g.
BeyondCompare) were mostly overwhelmed by both the file size and the line
lengths.
Luckily, I only ended up needing to make one edit to avoid an error creating an
index because the UTF-8 encoding considered "bugin" and "bügin" to be
equivalent.
[Commands]
I made most of the following changes from the command line, but made one change
using PHPMyAdmin.
Here's what I ended up doing:
Dump the current database. MySQL dumps to UTF-8 by default and converts
all text.
mysqldump --user=earthli -p --add-drop-table earthli > earthli.sql
Verify that the dump file is in UTF8 format. If it's not, then you can use
iconv to change the encoding (example from "Wikipedia"
):
iconv -f iso-8859-1 -t utf-8 -o
Search/replace the character set for each table with the following
command:
sed -e 's@CHARSET=latin1@CHARSET=utf8mb4@g' earthli.sql > earthli_utf8.sql
1. Use PHPMyAdmin to change the default encoding for the database toutf8mb4 in the Operations pane for the database.Import the database.
cat earthli_utf8.sql | mysql --user=earthli -p earthli
In PHP and the configuration, I made the following changes:
1. Call mysqli_set_charset ($this->_connection, 'utf8mb4'); after opening the
connection to the database
2. Change the encoding in all generated pages by including the tag < meta
charset="utf-8">
3. Change the default charset in the Apache config files php_value
default_charset UTF-8 (it's possible that this is already the default by
now)
[Conclusion]
It took a bunch of research and preparation and nerves to dump, globally modify,
and re-import a database that contains the last quarter-century of my writing.
In the end, though, it wasn't even that much work and it went smoothly. As
always with encodings, it serves you well to understand exactly what you're
doing -- it often saves a lot of steps.
And, now, because I can: ✊🏼.
--------------------------------------------------------------------------------
[1] The zip functions will be removed in favor the object-based API.
[1] 2.6k of which are published ... I write a lot of drafts that I never end up
publishing. Some of them are quite long, as well and serve as notes for
myself.
[1] It's probably a legacy thing or a desire to provide an option that uses one
less byte if you know you're not going to want emojis?
]]>