.

KGRKJGETMRETU895U-589TY5MIGM5JGB5SDFESFREWTGR54TY
Server : Apache/2.4.62
System : FreeBSD fbsdweb2.web.rcn.net 14.1-RELEASE FreeBSD 14.1-RELEASE releng/14.1-n267679-10e31f0946d8 GENERIC amd64
User : www ( 80)
PHP Version : 8.3.8
Disable Function : NONE
Directory : /domains/mandarintools/
Upload File :
Current File : /domains/mandarintools/codeguess.html
<HTML>
<HEAD>
<META NAME="description" CONTENT="Guesses the Chinese encoding of a given file or web page.">
<META NAME="keywords" CONTENT="chinese encoding unicode utf8 gb big5 hz ascii">
<TITLE>Chinese Encoding Auto-Detection</TITLE>
</HEAD>

<BODY BGCOLOR="lightyellow">
<CENTER><H1>Chinese Encoding Guesser</H1></CENTER>

<CENTER>
<H3>Guess for File on Your Computer</H3>
<FORM ENCTYPE="multipart/form-data" METHOD="post" ACTION="http://cgibin.erols.com/eepeter/cgi-bin/codeguess.pl">
File Name: <INPUT TYPE="file" SIZE="40" NAME="sourcename"><P>
<INPUT TYPE=HIDDEN NAME="inputtype" VALUE="file">
<INPUT TYPE=SUBMIT NAME="convtype" VALUE="Guess for this File!">
<INPUT TYPE=RESET VALUE="Reset Form">
</FORM>
</center>
<P>
<BR>
<P>
<!--

<H3>Guess for a Web Document</H3>
<FORM METHOD="post" ACTION="http://cgibin.erols.com/eepeter/cgi-bin/codeguess.pl">
Web Address:  <INPUT TYPE="text" NAME="sourcename" SIZE=50 VALUE="http://"><BR>
<INPUT TYPE="hidden" NAME="inputtype" VALUE="web">
<INPUT TYPE=SUBMIT NAME="convtype" VALUE="Guess for this Web Page!">
<INPUT TYPE=RESET VALUE="Reset Form">
</FORM>
<P>
-->
<STRONG>Supported Encodings</STRONG>:
GB2312, Hz, Big5, UTF8, ASCII, ("OTHER" for unrecognized encodings)
<P>
<I>Note</I>: To save on server time and space, only the first 100 lines from
the file or web document will be used in the guess.
<P>
  The new, <A HREF="sinodetect.html">Java version of this tool</A> is now available.
<HR>
<P>
	One of the problems in Chinese computing is the variety of
internal encodings that can be used to represent Chinese characters.  The
most common of these are Big5 (used in Taiwan, Hong Kong), GB2312-1980
(the National Standard of the People's Republic of China), and Hz (an
e-mail safe varient of GB).  Another scheme which I personally hope gains
in popularity is Unicode, which encodes about 21,000 simplified and
traditional characters.  UTF-8 is a variable length encoding of Unicode that
is useful on existing systems that do not yet support the UCS-2 form of 
Unicode.  For more information about Chinese (and Japanese and Korean)
encoding systems, I recommend <A HREF="ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf" TARGET="_TOP">CJK.INF</A> by Ken Lunde of Adobe Systems.
<P>
  What this web application does is to use several heuristics to determine
for a given document which encoding system is most likely.  It does this
in two stages.  First it checks to see if characters in the document fit the
code ranges for the given encoding system.  Then it checks the characters
in the document against frequency tables for a given encoding.  The encoding
system that scores highest on this ranking is the guess shown in the results.
If no encoding appears likely, then the application will return "OTHER".
<P>
  I've taken the perl code used by this application and have put it into a 
separate file (someday to become a perl module when I learn how to do that).
You can <A HREF="download/codelib.zip">download this code</A> (Perl5) and use it
in your own programs.  I'd like to add other encoding schemes (JIS, CNS,
KSC) as I learn more about them.
<P>
  I'm interested in hearing your ideas and suggestions for this tool.  Please
visit my <A HREF="contact.html">contact page</A> to submit
your comments.  If you came to this page directly, you might also want to take
a look at some of my other <A HREF="http://www.mandarintools.com" target="_top">on-line Chinese 
tools</A>.
	
</BODY>
</HTML>
Anon7 - 2021