Here is a utility written in C++ http://www.pastebin.ca/965243 that does this: counts up the frequency of each word in a text file; the output looks something like this
Lines Words Bytes File
3853 29076 163218 alice30.txt
---------------------------------------
19 A
1 ABOUT
1 ADVENTURES
3 ALICE
4 ALL
3 AND
2 ANY
(snip)
Time: 46 ms
and time is printed at the end.
For fun, I wrote a java version...
http://pastebin.com/f58b08979 (java version).
The text file to be used is 3.9 MB bible.txt
http://www.cas.mcmaster.ca/~bill/strings/english/bible
That would be test one: bible.txt. Make the file larger into 10x bibles, making it a 40 MB file, that would be test two: bible2.txt.
The client VM is faster than -server VM for files under 10 MB and -server VM is faster for larger files. In other words, use client VM for smaller files and -server for larger file.
C++ version was compiled with VC++: cl /O2 /GL wc1.cpp /link /ltcg
For 3.9 MB bible.txt
wc1 bible.txt>log.txt
java Jwc bible.txt>log.txt
Time: 546 ms (c++)
Time: 359 ms (java client)
For 40 MB bible2.txt
wc1 bible2.txt>log.txt
java -server Jwc bible2.txt>log.txt
Time: 5468 ms (c++)
Time: 2359 ms (java -server)
How come C++ version is much slower for 40 MB file? The difference increases with 80 MB file.
I want to see a C++ version that is faster than java version. Post it to http://pastebin.com/and post the link in comments, along with the time compared to java version.
But here is the criteria.
(1) C++ version must use standard library
(2) C++ version must compile on both GCC and VC++ (or at least with one of them).
(3) To reduce IO factor, time counting won't include the time to print output. Also, time counting should start from main() so not to include VM load time.
The fastest version I have seen so far is U++http://www.ultimatepp.org/www$uppweb$vsd$en-us.html (only 850 ms for 40 MB file!). However, it uses non std strings, map, and I/O. (Update: this changed now since the new Java version is faster)
UPDATE: C++ version was fixed. The problem was mostly with using map which is much slower than Hashmap.
The fixed C++ version. http://pastebin.com/f64eaec59
Time: ~2200 ms (C++ with 40 MB file)
Update: The even faster C++ version that uses trie structure instead of map http://pastebin.com/f222d2344
Time: 1078 ms
Update: The newer java version by pm_kirkham is even faster. Find it here http://pastebin.com/f5ca11b88
40 MB file
time 657 ms (java client)
time 595 ms (java server)
time 575 ms (Java Jet compiler)
Use command line: jc -inline+ Jwc.class to compile with Jet
Update: The C++ version that uses the same data structure for trie as the above Java version:
http://pastebin.com/f3a21c38f
time 560 ms (g++)
Subscribe to:
Post Comments (Atom)
2 comments:
Hey, I have created two programs for your request: one that uses std::string and one that uses char *.
I think both are faster than your CPP version but I cannot compare them to Java because I don't have any Java compiler and neither the knowledge to do that.
test2 is much more clean then Java version but a bit slower than test3.
The only difference between test2 and test3 is that I read all the file at once and I use char * instead of string.
It's a known fact that std::string is a bit slow.
Here are the links and the times I get for bible.txt.
test -> 680ms -> your version
test2 -> 660ms -> my std::string version
test3 -> 330ms -> my char * version.
test2 -> http://pastebin.com/f6a851e86
test3 -> http://pastebin.com/f68a0b61c
However, your CPP version didn't print bytes_count very well. It differs from the file size.
Regards,
asterisc
Post a Comment