Favourite Projects



A little awk script to change encoding in parts of file.

March 24th, 2011 by lukav

We’re in the process of migrating our ancient CVS with the more modern GIT. However I stumbled in the following problem. We make the commits comment in Bulgarian language with windows-1251 encoding. Git uses utf8 although I’m not sure if it does this natively or the client determines the commends encoding. So I had to change all the commit comments from cp1251 to UTF8. I couldn’t  just change the hole file, because some of the files have already changed the encoding in the work process and I wanted to keep the history and current encoding intact.

One way was to use “cvs admin -m rev:comment” command which changes the comment for a given revision in CVS, but that would mean I have to write a script that goes over each file, get all the log, then tries to figure out each revision and comment and use the admin command. Further more it had to work with multi-row comments. Although it is possible it seamed to me too much trouble with many points of breaking the comments.

So I looked at the idea of modifying the RCV files directly. I needed a tool to figure out the parts in the RCV (that is ,v file) between the lines containing only “log” and “text” and change the encoding only for those part. It doesn’t seam complicated, but when I tried to use my favorite ‘sed’ it couldn’t call the external ‘iconv’ for just parts of the file. So I needed an alternative.

After googling around it turns out awk was the tool for the job. It has the ability of calling system() function that executed external program for certain line.

So here it is. A awk file that looks for /^log$/ and then start to execute iconv for each line until it finds /^text$/.

#!/usr/bin/awk -f

/^log$/ {
    flag = 1

flag == 1 {
    str = gsub(/"/,"\\\"")
    system("echo \""$0"\" | iconv -f cp1251 -t utf8")

flag != 1 {print}

/^text$/ {
 flag = 0

Of course the file can be easily modified for different tasks.
Enjoy it.

Posted in EN, Tech | No Comments »