Tuesday, May 25, 2004
Translation
This is a pretty simple job, and can be broken down into the following subtasks:
Get text for translation and encode it into a HTTP POST request
Send the data to the web server, acting in effect as a .NET web browser
Read the response back into a big string
Remove all the HTML and formatting and send the raw translated string back to the client.
[WebMethod] public string BabelFish(string translationmode, string sourcedata) { }
readonly string[] VALIDTRANSLATIONMODES = new string[] {"en_zh", "en_fr", "en_de", "en_it", "en_ja", "en_ko", "en_pt", "en_es", "zh_en", "fr_en", "fr_de", "de_en", "de_fr", "it_en", "ja_en", "ko_en", "pt_en", "ru_en", "es_en"};
The code performs validation to check for a valid mode before passing it on to Babelfish. After that, we create a POST request. The syntax for a HTTP POST request looks something like this:
POST /babelfish/tr/ HTTP/1.0 Content-Type: application/x-www-form-urlencoded Content-Length: 51 lp=en_fr&tt=urltext&intl=1&doit=done&urltext=cheese
It's pretty simple, and if you want, you could use low-level sockets to write the data to the server. Microsoft provides some better ways to do this however, and so we use the HttpWebRequest class, which has lots of built-in features to make it easy to work with HTTP connections.
Uri uri = new Uri(BABELFISHURL); HttpWebRequest request = (HttpWebRequest) WebRequest.Create(uri); request.Referer = BABELFISHREFERER; // Encode all the sourcedata string postsourcedata; postsourcedata = "lp=" + translationmode + "&tt=urltext&intl=1&doit=done&urltext=" + HttpUtility.UrlEncode(sourcedata); request.Method = "POST"; request.ContentType = "application/x-www-form-urlencoded"; request.ContentLength = postsourcedata.Length; request.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"; Stream writeStream = request.GetRequestStream(); UTF8Encoding encoding = new UTF8Encoding(); byte[] bytes = encoding.GetBytes(postsourcedata); writeStream.Write(bytes, 0, bytes.Length); writeStream.Close(); HttpWebResponse response = (HttpWebResponse) request.GetResponse(); Stream responseStream = response.GetResponseStream(); StreamReader readStream = new StreamReader (responseStream, Encoding.UTF8); string page = readStream.ReadToEnd();
We end up with a string containing the entire Babelfish page. As it stands, this is about 99% noise (HTML tags, Altavista information, etc.), and 1% the translation we were looking for. So we need a regular expression to find the translated text. By looking at the HTML page, you will find the translation is contained between:
So the required regular expression looks like this (note: while testing my regular expressions, I got lots of help from Regulator):
This will match the whole
The brackets create a matching group, meaning that the text within the brackets (namely the translation) will be put in its own group at index 1 (index 0 contains the whole match).
The ?: pattern suppresses grouping: () normally creates a matching group: in this case, we are only using the pattern to allow for line breaks in long translations.
Finally *? is a lazy regular expression, matching every character up to the first instance of
Here's the code:
Regex reg = new Regex(@"
And subject to error checking, that's it!
Using it
Download the code, and unzip it somewhere. Add a virtual directory called Translation in IIS. Go to /translate.asmx and click Test, and enter some test data (say 'en_fr', and 'cheese'). If it works, you are ready to use it in your web and Windows Forms applications.
To use it in your app, add a Web Reference to the asmx, to the program you want to use it in; Visual Studio will create a proxy reference for you, which you can then use to perform translation.
Here's some sample code-behind:
namespace test
{
using System;
using System.Data;
using System.Drawing;
using System.Web;
using System.Web.UI.WebControls;
using localhost1; // assuming that's the reference generated
using System.Web.UI.HtmlControls;
///
/// Summary description for WebUserControl1.
///
public class WebUserControl1 : System.Web.UI.UserControl {
protected System.Web.UI.WebControls.DropDownList ddTranslationMode;
protected System.Web.UI.WebControls.TextBox txtText;
protected System.Web.UI.WebControls.Label lblTranslation;
protected System.Web.UI.WebControls.Button submitButton;
private void Page_Load(object sender, System.EventArgs e)
{
// Put user code to initialize the page here
}
protected void submitButton_Click(object sender, System.EventArgs e)
{
string translationMode = this.ddTranslationMode.SelectedItem.Value;
string translationText = this.txtText.Text.Trim();
string translation = "";
try
{
Translate tr = new Translate();
translation = tr.BabelFish(translationMode,translationText);
}
catch (Exception exp)
{
translation = "There was an error accessing the server: " + exp.Message;
}
this.lblTranslation.Text = translation;
}
}
}
Get text for translation and encode it into a HTTP POST request
Send the data to the web server, acting in effect as a .NET web browser
Read the response back into a big string
Remove all the HTML and formatting and send the raw translated string back to the client.
[WebMethod] public string BabelFish(string translationmode, string sourcedata) { }
readonly string[] VALIDTRANSLATIONMODES = new string[] {"en_zh", "en_fr", "en_de", "en_it", "en_ja", "en_ko", "en_pt", "en_es", "zh_en", "fr_en", "fr_de", "de_en", "de_fr", "it_en", "ja_en", "ko_en", "pt_en", "ru_en", "es_en"};
The code performs validation to check for a valid mode before passing it on to Babelfish. After that, we create a POST request. The syntax for a HTTP POST request looks something like this:
POST /babelfish/tr/ HTTP/1.0 Content-Type: application/x-www-form-urlencoded Content-Length: 51 lp=en_fr&tt=urltext&intl=1&doit=done&urltext=cheese
It's pretty simple, and if you want, you could use low-level sockets to write the data to the server. Microsoft provides some better ways to do this however, and so we use the HttpWebRequest class, which has lots of built-in features to make it easy to work with HTTP connections.
Uri uri = new Uri(BABELFISHURL); HttpWebRequest request = (HttpWebRequest) WebRequest.Create(uri); request.Referer = BABELFISHREFERER; // Encode all the sourcedata string postsourcedata; postsourcedata = "lp=" + translationmode + "&tt=urltext&intl=1&doit=done&urltext=" + HttpUtility.UrlEncode(sourcedata); request.Method = "POST"; request.ContentType = "application/x-www-form-urlencoded"; request.ContentLength = postsourcedata.Length; request.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"; Stream writeStream = request.GetRequestStream(); UTF8Encoding encoding = new UTF8Encoding(); byte[] bytes = encoding.GetBytes(postsourcedata); writeStream.Write(bytes, 0, bytes.Length); writeStream.Close(); HttpWebResponse response = (HttpWebResponse) request.GetResponse(); Stream responseStream = response.GetResponseStream(); StreamReader readStream = new StreamReader (responseStream, Encoding.UTF8); string page = readStream.ReadToEnd();
We end up with a string containing the entire Babelfish page. As it stands, this is about 99% noise (HTML tags, Altavista information, etc.), and 1% the translation we were looking for. So we need a regular expression to find the translated text. By looking at the HTML page, you will find the translation is contained between:
translation here
So the required regular expression looks like this (note: while testing my regular expressions, I got lots of help from Regulator):
((?:.|\n)*?)
This will match the whole
...
string. This is a fairly complex regular expression, but basically, the . character matches everything, except for newlines, hence the (.|\n) pattern, which means any character (except newlines) or new lines.
The brackets create a matching group, meaning that the text within the brackets (namely the translation) will be put in its own group at index 1 (index 0 contains the whole match).
The ?: pattern suppresses grouping: () normally creates a matching group: in this case, we are only using the pattern to allow for line breaks in long translations.
Finally *? is a lazy regular expression, matching every character up to the first instance of
. (If I had used plain *, the expression would be greedy, and would chomp right up to the LAST
.)
Here's the code:
Regex reg = new Regex(@"
(.*?)
"); MatchCollection matches = reg.Matches(page); if (matches.Count != 1 || matches[0].Groups.Count != 2) { return ERRORSTRINGSTART + "The HTML returned from Babelfish " + "appears to have changed. Please check for" + " an updated regular expression" + ERRORSTRINGEND; } return matches[0].Groups[1].Value;
And subject to error checking, that's it!
Using it
Download the code, and unzip it somewhere. Add a virtual directory called Translation in IIS. Go to /translate.asmx and click Test, and enter some test data (say 'en_fr', and 'cheese'). If it works, you are ready to use it in your web and Windows Forms applications.
To use it in your app, add a Web Reference to the asmx, to the program you want to use it in; Visual Studio will create a proxy reference for you, which you can then use to perform translation.
Here's some sample code-behind:
namespace test
{
using System;
using System.Data;
using System.Drawing;
using System.Web;
using System.Web.UI.WebControls;
using localhost1; // assuming that's the reference generated
using System.Web.UI.HtmlControls;
///
/// Summary description for WebUserControl1.
///
public class WebUserControl1 : System.Web.UI.UserControl {
protected System.Web.UI.WebControls.DropDownList ddTranslationMode;
protected System.Web.UI.WebControls.TextBox txtText;
protected System.Web.UI.WebControls.Label lblTranslation;
protected System.Web.UI.WebControls.Button submitButton;
private void Page_Load(object sender, System.EventArgs e)
{
// Put user code to initialize the page here
}
protected void submitButton_Click(object sender, System.EventArgs e)
{
string translationMode = this.ddTranslationMode.SelectedItem.Value;
string translationText = this.txtText.Text.Trim();
string translation = "";
try
{
Translate tr = new Translate();
translation = tr.BabelFish(translationMode,translationText);
}
catch (Exception exp)
{
translation = "There was an error accessing the server: " + exp.Message;
}
this.lblTranslation.Text = translation;
}
}
}