Here is a utility written in C++ http://www.pastebin.ca/965243 that does this: counts up the frequency of each word in a text file; the output looks something like this
Lines Words Bytes File
3853 29076 163218 alice30.txt
Time: 46 ms
and time is printed at the end.
For fun, I wrote a java version...
http://pastebin.com/f58b08979 (java version).
The text file to be used is 3.9 MB bible.txt
That would be test one: bible.txt. Make the file larger into 10x bibles, making it a 40 MB file, that would be test two: bible2.txt.
The client VM is faster than -server VM for files under 10 MB and -server VM is faster for larger files. In other words, use client VM for smaller files and -server for larger file.
C++ version was compiled with VC++: cl /O2 /GL wc1.cpp /link /ltcg
For 3.9 MB bible.txt
java Jwc bible.txt>log.txt
Time: 546 ms (c++)
Time: 359 ms (java client)
For 40 MB bible2.txt
java -server Jwc bible2.txt>log.txt
Time: 5468 ms (c++)
Time: 2359 ms (java -server)
How come C++ version is much slower for 40 MB file? The difference increases with 80 MB file.
I want to see a C++ version that is faster than java version. Post it to http://pastebin.com/and post the link in comments, along with the time compared to java version.
But here is the criteria.
(1) C++ version must use standard library
(2) C++ version must compile on both GCC and VC++ (or at least with one of them).
(3) To reduce IO factor, time counting won't include the time to print output. Also, time counting should start from main() so not to include VM load time.
The fastest version I have seen so far is U++http://www.ultimatepp.org/www$uppweb$vsd$en-us.html (only 850 ms for 40 MB file!). However, it uses non std strings, map, and I/O. (Update: this changed now since the new Java version is faster)
UPDATE: C++ version was fixed. The problem was mostly with using map which is much slower than Hashmap.
The fixed C++ version. http://pastebin.com/f64eaec59
Time: ~2200 ms (C++ with 40 MB file)
Update: The even faster C++ version that uses trie structure instead of map http://pastebin.com/f222d2344
Time: 1078 ms
Update: The newer java version by pm_kirkham is even faster. Find it here http://pastebin.com/f5ca11b88
40 MB file
time 657 ms (java client)
time 595 ms (java server)
time 575 ms (Java Jet compiler)
Use command line: jc -inline+ Jwc.class to compile with Jet
Update: The C++ version that uses the same data structure for trie as the above Java version:
time 560 ms (g++)