[announce] (nearly) full jgit revision machinary
Shawn O. Pearce
spearce at spearce.org
Wed Aug 31 22:00:12 BST 2011
I have a nearly complete revision machinary implementation for jgit.
The series follows on top of my pack index v2 work, which is in my
master branch of repo.or.cz/egit/spearce.git.
This series started because I found the History view was just too
slow for words. Running JProbe on Eclipse pointed out that our
existing revision walking code was not up to the challenge of a
real-world sized repository.
Here goes a summary... series is here:
git://repo.or.cz/egit/spearce.git revwalk
Fixes and cleanups
------------------
* Correct unsigned integer conversion issues
* Include refs/remotes as part of the ref search path
* Only use bytes 1-5 for a SHA-1 ObjectId hashCode
* Fix the OBJECT_ID_LENGTH constant at 20
* Hand unroll the ObjectId.equals method
* Don't bother caching commits in the delta base cache
* Teach FileMode how to return the proper instance of itself
* Teach FileMode about S_IFGITLINK being 0160000
* Teach FileMode about MISSING being 0
These have nothing to do with the series directly, but were
done to improve its code or fix some earlier bugs.
Revision Machinary
------------------
* New revision walking library for jgit
* Teach RevWalk about uninteresting commits
* Document the flag constants used in RevWalk
* Implement a commit filtering API for RevWalk
* Make all RevFilter implementations cloneable
* Allow RevFilters to break out of the main walk loop
* Implement a commit time based RevFilter for before/after filtering
* Refactor basic commit walking from RevWalk to a streamable iterator API
* Evaluate RevFilters before inserting parent commits to pending
* Dispose of a RevCommit as soon as it is no longer interesting
* Use a FIFO queue of RevCommit during reset rather than sorting by date
Aside from --topo-order this is a nearly full revision machinary.
Features like --author, --committer, --grep, --not, all work.
We don't parse "A..B" but we can do it as "A --not B", or
"A ^B ^C D".
The package's performance on Java 6/x86/Windows is well within
spitting distance of Git on Cygwin. I'm seeing jgit take just
over 50 ms longer to produce back ~13,000 commits.
I'll get --topo-order done soon, most likely this weekend. It
will probably also get an incremental restart trick like we do
in C Git, where we shovel the data out as early as we can but
signal a restart if topo order was violated.
Memory can still be reduced a tad, but its fast as heck.
Our main suckage for memory is the ObjectId class. On a modern
HotSpot JVM it costs 44 bytes to store a 20 byte SHA-1. I think
our second main suckage is the ObjectIdMap. java.util.HashMap is
not that efficient when it comes to memory usage.
Tree Diff
---------
* New tree walking library for jgit
* Implement a tree entry filtering API for TreeWalk
* Define a path limiter for TreeWalk to reduce results
Turns out the Tree class is too slow for serious work. This is
a concurrent tree walker that can do an N-way difference of
an arbitrary number of trees, recursively and shallow. We can
also limit the difference to only entries matching a particular
filter expression.
I spent a good amount of time tuning this library, but it
probably can still benefit from more micro-optimization. Java is
just dog slow, but I think we're doing almost the best we can.
At present this library only does tree objects, but it should
also be able to walk the local filesystem and an index (or
two or three) concurrently with trees. I'll implement those
other iterators in the near future.
Path Limiter & Parent Rewrites
------------------------------
* Implement path limited revision walking and parent rewriting
* Allow revwalk and treewalk to directly access cached object data
These changes give us functionality like "git log HEAD -- foo.c",
where the DAG gets subsetted down to only those commits that
had an interesting change against foo.c. Yes, it also works
on directories and combinations of paths.
I've done a bunch of manual testing and differencing against
the C Git output for the same inputs, and jgit is producing
identical results. That's with "--parents". So we can now
get a dense, plottable subsetted DAG. Yay.
When the path limiter is enabled we spend ~98% of our CPU time
inside of the treewalk package introduced above for the tree
difference feature. Its horribly slow. If the path we are
looking for is early enough in the tree we're not too far off
from C Git's performance, but try matching on say "t" and the
world falls over. We are 4x slower.
Fixes and cleanups
------------------
* Implement some basic command line programs built on top of jgit
This a "jgit" command line wrapper, along with basic subcommand
tools like "log", "rev-list", "ls-tree" and "diff-tree", all
built on top of the above functionality. jgit is starting to
feel like it is actually capable of doing some of the major
functions C Git does.
"jgit log | less" has a slight lag before the first commit when
compared to "git log", but is still faster than "svn log". ;-
--
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
More information about the git-announce
mailing list