Hang in Fetch after Clone while converting to mirror #356
Labels
No Label
bug
duplicate
enhancement
invalid
question
up for grabs
wontfix
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: jcarr/git2go#356
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The following program hangs most times I run it (but not always):
It hangs during the call to remote.Fetch. Sample run, killed manually by SIGQUIT when it hung:
Reproduced with both Go 1.7.4 and Go 1.8 beta 1.
Doing some horrible manual symbol resolution, I managed to tease out this backtrace:
Frames 14 to 22 are all in Go world and are uninteresting. I believe that the pcs in the 0x00007fff... range correspond to assertion failures.
libgit2 itself does not appear to be the problem. The following equivalent C program always works successfully.
This smells like memory corruption due to a race somewhere, but I don't see where/how. Running under the Go race detector yields no complaints. I'm running darwin, not linux, so I can't test with
-msan
.Ah. If I let it run long enough, I eventually get
curl error: Failed to get recent socket
. Looking at dtruss output, I think the problem might indeed be that the clone happens on one system thread and the fetch happens on a different one.I see a slew of runtime.LockOSThread calls throughout libgit2, but each only lasts as long as an individual git2go function/method call. That means that the Go runtime is free to switch from thread to thread between calls, which is indeed very likely to happen for cgo calls. Looking at the dtruss output (https://gist.github.com/josharian/c8b990eec873f245e30b6f29bfcc9245), it appears that in fact different threads are in use.
I wonder whether a different thread-locking strategy is required. An example alternative is for a Repository to spin up a goroutine that does nothing but read from a
chan func()
and execute the functions that it gets passed; that goroutine could be locked to an OS thread. That would mean thatRepository.Free
is no longer optional.Or maybe I have mis-diagnosed. In any case, this is a blocker for me to be able to use git2go. Please let me know what I can do to help.
More data.
I've instrumented libgit2 (now using next / tip) and spied using Wireshark.
The clone succeeds. The fetch begins, and transfers a bunch of data. Then, for no reason I can see, the TCP traffic ceases and QUIC traffic begins. At this point, libgit2 is blocked in a select statement in wait_for, called from curls_read (all in curl_stream.c). Then, about 4 minutes later, the server sends a TCP FIN ACK. The select completes and curls_read returns an error.
None of this explains why the C code succeeds, but this no longer really looks to me like a threading bug.
I've been unable to reproduce this locally, but FWIW you don't need the second fetch, we have a callback to override the remote creation precisely for this use-case. The library will call your code instead of its own if you have this override and then you can do whatever during the initial remote creation, like git's own
git clone --mirror
does.I've reworked the code from the issue description to a test that does it in a single step to test things out. While it doesn't fix the issue per se, it would avoid triggering it and it avoids the no-transfer network connection.
As it happens, running a test against go tip https://travis-ci.org/libgit2/git2go/jobs/186415531 (which will become 1.8) errors out in fetch, so it does seem likely we're messing up somehow, though I still don't know exactly where.
Coming back to this after a while... I can sometimes reproduce this hang with just the tests we have in the codebase. Sometimes it outright crashes. Via printf-debugging I see in the crashes that the finalizer is running for the
Remote
object, even though we're in the middle of one of its functions.It looks like the much more aggressive garbage collector in 1.8 does not consider the receiver to be keeping it alive. It does look exactly like the docs for
runtime.KeepAlive
show.I'm on vacation for the next few weeks, but that seems plausible.
I'm going to consider this closed as of #393, do reopen if it's still broken.