Opened 12 years ago
Closed 10 years ago
#19 closed defect (fixed)
git: Apply patches by bauxite
Reported by: | dmik | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | *none | Version: | |
Severity: | Keywords: | ||
Cc: | dryeo, steve53@… |
Description
komh maintains its own build of git here: http://bauxite.sakura.ne.jp/software/os2/. These patches contain some important fixes: in particular, they fix cloning large repositories (where our RPM git fails) and "out of memory" when doing gc/repack.
Note that komh patches already contain sava patches that our SVN also contains but komh uses git 1.7.3.2 source base.
Attachments (1)
Change History (76)
comment:1 by , 12 years ago
comment:2 by , 12 years ago
The good thing is that komh patches fix the problem with starting scripts from libexec/git-core which exists on the trunk.
comment:4 by , 12 years ago
Note that the solutions like this one http://stackoverflow.com/questions/6842687/the-remote-end-hung-up-unexpectedly-while-git-cloning don't actually help. I don't know if it is related at all. This needs deeper debugging.
comment:5 by , 11 years ago
Summary: | git: Apply patches by komh → Prioritize SHELL over EMXSHELL |
---|---|
Type: | task → defect |
I'm now trying to finish my git work in the meanwhile (while there is a delay with making WLINK work with the recent Firefox).
comment:6 by , 11 years ago
Summary: | Prioritize SHELL over EMXSHELL → git: Apply patches by komh на Prioritize SHELL over EMXSHELL |
---|---|
Type: | defect → task |
comment:7 by , 11 years ago
Summary: | git: Apply patches by komh на Prioritize SHELL over EMXSHELL → git: Apply patches by komh |
---|---|
Type: | task → defect |
AAAAAAAAAAAAAA. Safari on Mac has run completely out of mind. It substitutes the same values from the last ticket I posted to (http://trac.netlabs.org/libc/ticket/287#comment:5).
comment:8 by , 11 years ago
Type: | defect → task |
---|
comment:9 by , 11 years ago
Type: | task → defect |
---|
I realize that http-push
is absent from both my builds and the RPM build by Yuri (but present in the original KOMH builds). This seems to depend on CURL 7.9.8 or above (we have 7.21.1) and EXPAT. But we don't have libexpat
(at least not in RPM) so git-http-push.exe
is not built. This explains why our "stock" RPM build couldn't push over HTTP. I don't get how my local build can do that though since I don't have git-http-push.exe
too... May be I hacked it somehow, I don't actually remember. This has always been an unfinished hack...
Anyway, this doesn't explain why push over HTTP works but also fails with various RPC errors and unexpected messages in the original KOMH build. But it may point at the area of searching at least.
And the HTTP push problem is the most I suffer from now. Only 2 of my 10 commits work (with either of the builds I have). I often have to use my Mac machine in order to commit Mozilla changes. This somehow seems to depend on the size of the commit object. Very small commits usually work very well. But if it's bigger than some size, then no way. I can't recall the exact error message, I will paste it next time I see it.
The HTTP clone problem may be somehow related to the HTTP push problem. But the fact is that the original build from KOMH (git 1.7.3.2) is free from that problem... As I already guess this may be a regression of some sort. May be updating to 1.8/1.9.2.0 will help, we will see. I need to collect more details on all that.
This is all a bit messed up in my head ATM. Too many things are on the plate. Will continue sorting them out.
comment:10 by , 11 years ago
Type: | defect → task |
---|
NO ITS NOT A DEFECT, IT"S A TASK! SAFARI, WHOEVER, STOP!!
comment:11 by , 11 years ago
Type: | task → defect |
---|
Btw, this is what I usually get now here at netlabs:
Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/trac/web/api.py", line 514, in send_error data, 'text/html') File "/usr/local/lib/python2.7/site-packages/trac/web/chrome.py", line 968, in render_template message = Markup(req.session.pop('chrome.%s.%d' File "/usr/local/lib/python2.7/site-packages/trac/web/api.py", line 316, in __getattr__ value = self.callbacks[name](self) File "/usr/local/lib/python2.7/site-packages/trac/web/main.py", line 268, in _get_session return Session(self.env, req) File "/usr/local/lib/python2.7/site-packages/trac/web/session.py", line 200, in __init__ if req.authname == 'anonymous': File "/usr/local/lib/python2.7/site-packages/trac/web/api.py", line 316, in __getattr__ value = self.callbacks[name](self) File "/usr/local/lib/python2.7/site-packages/trac/web/main.py", line 135, in authenticate authname = authenticator.authenticate(req) File "/usr/local/lib/python2.7/site-packages/trac/web/auth.py", line 91, in authenticate req.incookie['trac_auth']) File "/usr/local/lib/python2.7/site-packages/trac/web/auth.py", line 238, in _get_name_for_cookie name = self._cookie_to_name(req, cookie) File "/usr/local/lib/python2.7/site-packages/trac/web/auth.py", line 234, in _cookie_to_name for name, in self.env.db_query(sql, args): File "/usr/local/lib/python2.7/site-packages/trac/db/api.py", line 122, in execute return db.execute(query, params) File "/usr/local/lib/python2.7/site-packages/trac/db/util.py", line 121, in execute cursor.execute(query, params) File "/usr/local/lib/python2.7/site-packages/trac/db/util.py", line 65, in execute return self.cursor.execute(sql_escape_percent(sql), args) File "/usr/local/lib/python2.7/site-packages/trac/db/sqlite_backend.py", line 78, in execute result = PyFormatCursor.execute(self, *args) File "/usr/local/lib/python2.7/site-packages/trac/db/sqlite_backend.py", line 56, in execute args or []) File "/usr/local/lib/python2.7/site-packages/trac/db/sqlite_backend.py", line 48, in _rollback_on_error return function(self, *args, **kwargs) OperationalError: database is locked
comment:12 by , 11 years ago
Back to git, the sad thing is that it doesn't allow building out of the source tree. There is configure
that is supposed to support this but apparently there are mistakes in Makefiles (places that don't take this scenario into account).
comment:13 by , 11 years ago
Type: | defect → task |
---|
comment:14 by , 11 years ago
This is completely disgusting and mind blowing. I don't understand how such a nice and (relatively) small tool may have such a creepy house keeping. Absurd. I will have to feel pain as I don't want to clean out this dirt ATM (no time and I'm really tired to be a dirt cleaner in every house).
comment:15 by , 11 years ago
Type: | task → defect |
---|
Meanwhile, I found that the latest expat version (2.1.0) includes some bits of OS/2 already: http://expat.cvs.sourceforge.net/viewvc/expat/expat/watcom/?hideattic=0&pathrev=R_2_1_0. It's intended for satcom but I think it shouldn't be difficult to build it with GCC — DOS-related changes are marked with WATCOM (and perhaps the ones with WIN32 should be paid attention).
The ports ticket: #23.
comment:16 by , 11 years ago
I finally built git 1.7.9.6 from this SVN's trunk (i.e. no KOMH patches) using the RPM env and our own expat. The two major problems are still here:
- it can't clone big repos like https://github.com/psmedley/gcc. Fails with:
Cloning into '.'... error: RPC failed; result=52, HTTP code = 0 fatal: The remote end hung up unexpectedly
- It can't push big commits (any repo). Fails with:
error: RPC failed; result=52, HTTP code = 0 fatal: The remote end hung up unexpectedly fatal: The remote end hung up unexpectedly
With the build from the KOMH branch, problem 1 disappears, problem 2 is still there.
I will do the following now:
- Update trunk to a more recent version, 1.8/1.9/2.0, to see if the problems still persist.
- If yes, then sort out KOMH patches to find which change fixes problem 1.
- Try to fix problem 2 as well.
- Check other KOMH patches and apply those we need.
comment:17 by , 11 years ago
I've imported top releases of each version (1.8.5.5, 1.9.4 and 2.0.0). But of course there is a problem with merging due to unusual file names test cases use (and the weakness of our SVN client in handling this). That's what I get:
svn: Unable to parse URL '/repos/ports/git/vendor/1.7.9.6/t/t4013/diff.format-patch_--inline_--stdout_initial..master^^' svn: Error reading spooled REPORT request response
The error is completely meaningless. The name itself looks pretty legal, other SVN commands (checkout, update) handle it well. So I really wonder what it could be (I'm using SVN 1.6.16 from Paul AFAIR). But anyway, I'm not going to fix SVN right now as it's already too much on my pipeline.
I will create a temp branch (will name it dmik
) and experiment on that branch to see which update we can use on OS/2 with less effort. A temporary branch is necessary to merge trunk
and vendor/2.0.0
on Mac (where I have subversion 1.7 and all works).
comment:18 by , 11 years ago
Thats exactly the same error I had with Ghostscript. So I decided to update to 1.7 and the error went away.
See http://os2ports.smedley.id.au/index.php?page=subversion for download links.
comment:19 by , 11 years ago
I've just tried SVN 1.7 from Paul and it works well, no errors like that. Great. We will do an RPM for it too.
comment:20 by , 11 years ago
I've updated git to 2.0.0 and unfortunately this didn't fix any of the RPC failures. The errors are completely the same. This means that the problem is somehow OS/2 specific. I will have to step debug it and study KOMH patches once again.
comment:21 by , 11 years ago
Tried to grab git_os2_read()
/git_os2_write()
and _select2()
changes from KOMH patches, but none seems to help with the HTTP problems. There is also a pipe/poll reimplementation which uses native OS/2 pipes. I will look closer if it may be related. The other code doesn't seem to be relative to the HTTP transport at all so that must be it...
comment:22 by , 11 years ago
I've applied the pipe/poll code from KOMH and it's still not working. Need to step debug the relevant code and perhaps expat.
comment:23 by , 11 years ago
Using GIT_CURL_VERBOSE=1
is very handy to debug HTTP(S) sessions. From what I see, there are the following differences between the original 1.7.3.2 version from KOMH (works), 1.7.9.6 trunk (doesn't) and 2.0.0 dmik branch (doesn't):
- Both 1.7.9.6 and 2.0.0 spit messages like
0x20059d60 is at send pipe head!
while connecting to github andgetpeername() failed with errno 36: Operation now in progress
. 1.7.3.2 from KOMH doesn't do that. - 1.7.3.2 issues
POST /psmedley/gcc/git-upload-pack HTTP/1.1
withContent-Length: 1115
andExpect: 100-continue
and gets a successful reply withContent-Type: application/x-git-upload-pack-result
. - 1.7.9.6 issues this POST with
Content-Length: 1123
and fails withEmpty reply from server
. - 2.0.0 issues this POST with
Content-Length: 1140
and fails withEmpty reply from server
.
comment:24 by , 11 years ago
Content-Length doesn't seem to be directly related; git 1.8.5.2 on Mac has it set to 1151 and all works. However, on Mac git also says * upload completely sent off: 1151 out of 1151 bytes
before getting a reply from github. Though this may be a version difference (more logging in 1.8) this gives a hint that for some reason content of the POST request is not sent to the server in our versions 1.7.9.6 and 2.0.0. I will check this.
Another thing is that AFAIR my own build of the version with KOMH patches behaved just like the current trunk (i.e. didn't work). But it never was 1.7.3.2, it was 1.7.9.6 with KOMH patches from 1.7.3.2 applied. So there are two possibilities: either KOMH versions of SSL and CRYPTO libraries (kssl and crypto) differ from what we have in terms of working with HTTP or there was some change after 1.7.3.2 that made KOMH patches ineffective. This also needs some thinking and checking.
comment:25 by , 11 years ago
Another thing to mention is that the original KOMH build links to both SSL and CRYPTO but our builds only link to CRYPTO and also the KOMH build doesn't link to CURL which means he uses his own static CURL build (which could also contain some private fixes). Also the KOMH build doesn't use MMAP (but I don't think it's related to these problems).
comment:26 by , 11 years ago
Yes, KOMH has his own static build of CURL (http://bauxite.sakura.ne.jp/software/os2/misc/curl-7.20.1-os2-20100522.zip) which also depends on LIBSSH2 (http://bauxite.sakura.ne.jp/software/os2/misc/libssh2-1.2.4-os2-20100301.zip). However, quickly linking git-remote-http.exe against it (which dropped CURL7.DLL dependency and dragged in SSL10.DLL) didn't cure the problem. So it seems to be irrelevant.
I guess it has something to do with strange messages like 0x20059d60 is at send pipe head!
and getpeername() failed with errno 36: Operation now in progress
(which are results of git evolution after 1.7.3.2) then. I will concentrate on them and check how HTTP work.
comment:27 by , 11 years ago
JFTR, linking against KOMH's SSL and CRYPTO doesn't help either. This, again, narrows the problem down to the new post-1.7.3.2 git code.
comment:28 by , 11 years ago
Performed one more test: built git 1.7.3.2 with original KOMH patches, both using KOMH libs and using RPM libs (curl/ssh/crypto). Works, regardless of libs. So it's definitely the change in git after 1.7.3.2 that broke it. Okay, good to know that.
comment:29 by , 11 years ago
Another observation: 0x20059d60 is at send pipe head!
and getpeername() failed with errno 36: Operation now in progress
are not in vain as I see it in my 1.7.3.2 build too but it works. It's now clearly seen that the place where things break is the POST /psmedley/gcc/git-upload-pack HTTP/1.1
request. The old git gets an immediate reply to that the new git seems to not get any reply and finally fails with the message The remote end hung up unexpectedly
. The only other visible difference between different Content-Length
is that the old git sets also Expect: 100-continue
. The new git doesn't. But that's unlikely to be the cause since new gits on other platforms don't set this header field at all.
comment:30 by , 11 years ago
Just to make sure we don't miss something important outside I updated our CURL port from 7.21.1 to the latest 7.37. I got these new * upload completely sent off: 1140 out of 1140 bytes
messages in the debug output with set GIT_CURL_VERBOSE=1
but the problem is still the same. Okay, at least we have the newest CURL. I will commit it later, when git is done (the new CURL doesn't actually require a lot of work, just two small patches in lib/url.c
, one old from Yuri and one new from me (trivial); more over, given our new and shiny autoconf toolchain, it eliminates the need of other Yuri's patches and builds just out of the box).
comment:31 by , 11 years ago
Another interesting thing with the new CURL is that 0x20059d60 is at send pipe head!
and getpeername() failed with errno 36: Operation now in progress
have disappeared. They come from the old CURL then and should be ignored.
comment:32 by , 11 years ago
I compared the RPC packet sent on clone
by git 1.7.3.2 (which works) and 2.0.0 (which doesn't). There are only minor differences (the later version of git has a couple of additional fields), the rest is identical. More over, I grabbed the working packet and made it send it from the trunk version instead of its own data — and it still doesn't work. This proves that the problem is not the packet contents but the connection itself. Somehow, newer git versions establish it differently on OS/2 and that confuses the git server.
Note that I also tried to build 2.0.0 without USE_CURL_MULTI (which enables simultaneous HTTP transfers) since this mode was absent in 1.7.3.2 and it downloaded everything in one stream. But this didn't help either. So so the problem is not related to the multithreaded transfer mode.
comment:33 by , 11 years ago
Just noticed that git uses fork() to start its commands on OS/2. And that it also uses so-called "stateless-rpc" mode where the child process doesn't open a new HTTP connection but instead pipes to and from the server through the parent (which holds an open connection) via stdin/stdout. Maybe this is where things break as it's known our fork() impl isn't perfect. One doubt is that git 1.7.3.2 also uses fork() and all works. But anyway it's good to get rid of fork in this particular case as it's used too intensively.
comment:34 by , 11 years ago
I replaced fork
with spawnvpe
and it still doesn't work. Wrong path... I will leave the spawnvpe changes in though since they are good to have anyway. Will commit them when I find what's wrong.
comment:35 by , 11 years ago
I'm comparing git 2.0.1 output on Mac with git 2.0.0 on OS/2 with more logging (-v -v -v -v, GIT_TRANSPORT_HELPER_DEBUG=1, GIT_TRANSLOOP_DEBUG=1, GIT_CURL_VERBOSE=1, GIT_TRACE=1, GIT_TRACE_PACKET=1) and the Mac version issues the very same packet (both contents and length), and all the sequence of operations is completely the same and it works (in fact, the trace log output is almost identical). This leads me to a conclusion that the problem may lie in the CURL library itself but it is simply not triggered by earlier git versions because they don't use some features. I will add more debugging to CURL then.
Among the differences is the fact that on mac a TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
is established while on OS/2 it is SSL connection using TLSv1.0 / AES128-SHA
. I wonder if this can be the source of problems...
comment:36 by , 11 years ago
I simplified the situation down to the following. I installed a local Apache server to host the GCC git repository locally and see what's going on from the other side (this is not a clean test case though since both github and bitbucket use their own HTTP servers for git, but still). I didn't bother setting HTTPS and used plain HTTP (and due to my simple setup POST is also not used, only GET — which makes things a bit easier to track down). The Mac client works, the OS/2 client fails but with different diagnosis:
Getting pack e9d9ca0c0c74d2fea9e6114e33269ea6db4ec519 which contains d09a35b6abdfcc5c88ac0ba1543ea5f37a4df113 * Couldn't find host 192.168.1.102 in the .netrc file; using defaults * Found bundle for host 192.168.1.102: 0x2004a038 * Re-using existing connection! (#3) with host 192.168.1.102 * Connected to 192.168.1.102 (192.168.1.102) port 8888 (#3) > GET /git/gcc.git/objects/pack/pack-e9d9ca0c0c74d2fea9e6114e33269ea6db4ec519.pack HTTP/1.1 User-Agent: git/2.0.0 Host: 192.168.1.102:8888 Accept: */* Accept-Encoding: gzip < HTTP/1.1 200 OK < Date: Thu, 03 Jul 2014 19:45:10 GMT * Server Apache/2.2.26 (Unix) mod_fastcgi/2.4.6 mod_wsgi/3.4 Python/2.7.6 PHP/5.5.10 mod_ssl/2.2.26 OpenSSL/0.9.8y DAV/2 mod_perl/2.0.8 Perl/v5.18.2 is not blacklisted < Server: Apache/2.2.26 (Unix) mod_fastcgi/2.4.6 mod_wsgi/3.4 Python/2.7.6 PHP/5.5.10 mod_ssl/2.2.26 OpenSSL/0.9.8y DAV/2 mod_perl/2.0.8 Perl/v5.18.2 < Last-Modified: Thu, 03 Jul 2014 15:27:58 GMT < ETag: "109a179-2ab7dd29-4fd4ba74c1b80" < Accept-Ranges: bytes < Content-Length: 716692777 < Content-Type: text/plain < * Closing connection 3 error: Unable to get pack file http://192.168.1.102:8888/git/gcc.git/objects/pack/pack-e9d9ca0c0c74d2fea9e6114e33269ea6db4ec519.pack The requested URL returned error: 404 Not Found error: Unable to find d09a35b6abdfcc5c88ac0ba1543ea5f37a4df113 under http://192.168.1.102:8888/git/gcc.git Cannot obtain needed object d09a35b6abdfcc5c88ac0ba1543ea5f37a4df113 error: Fetch failed. Debug: Remote helper quit.
So it somehow can't download the pack file. It sees the file and the server replies with the correct size response but git aborts. It's certainly the git for OS/2 problem as, again, on Mac it finds this pack file is downloaded successfully. I hope that breaking this down will help me with the POST code path as well.
comment:37 by , 11 years ago
This is somehow related to TCP socket transfer timings. When I added more logging my local server started working: the given .pack file was successfully downloaded by git. I'm trying to find a git server that would serve HTTPS and support the git-upload-pack
service that involves the POST method.
comment:38 by , 11 years ago
I tried the GitBlit local server http://gitblit.com/ but apparently it somehow is broken. Neither git on mac nor git on OS/2 can clone via HTTPS from it. Looking for another one...
comment:39 by , 11 years ago
Okay, I simply set up Apache on Mac using MAMP (http://www.mamp.info), then enabled the upload-pack
backend using the official git manual (https://www.kernel.org/pub/software/scm/git/docs/git-http-backend.html) and got it work in this mode with POST. Git on mac works well in this setup and, guess what, git on OS/2 works well too! It gets a reply from the server after issuing the POST /psmedley/gcc/git-upload-pack HTTP/1.1
request and all goes fine after that.
This means it's either logging timing again or the problem is related to the external network interface setup (the server is run in my LAN so the traffic doesn't go through the router to the outer world).
comment:40 by , 11 years ago
It actually doesn't work on the local server after rebooting the OS/2 machine until I add full logging again. Definitely, timings. And not related to the type of the interface. Fails with this:
POST git-upload-pack (gzip 2252 to 1140 bytes) *** [1] remote-curl.c:636:post_rpc: *** [265:1] D:/Coding/ports/curl/curl-7.37.0/lib/http.c:2507:Curl_http: postfields 0x201ac220 (1140) *** [265:1] D:/Coding/ports/curl/curl-7.37.0/lib/http.c:2517:Curl_http: postsize 1140 fatal: write error: Bad address fatal: early EOF fatal: index-pack failed
I.e. it can't even write the POST packet out.
But even with logging enabled, after successfully receiving the pack file git starts the index-pack
command and that also fails on OS/2 with the following output:
trace: run_command: 'index-pack' '--stdin' '--fix-thin' '--keep=fetch-pack 225 on hugaida' '--check-self-contained-and-connected' '--pack_header=2,1317807' trace: exec: 'git' 'index-pack' '--stdin' '--fix-thin' '--keep=fetch-pack 225 on hugaida' '--check-self-contained-and-connected' '--pack_header=2,1317807' trace: built-in: git 'index-pack' '--stdin' '--fix-thin' '--keep=fetch-pack 225 on hugaida' '--check-self-contained-and-connected' '--pack_header=2,1317807' fatal: write error: Bad address fatal: early EOF fatal: index-pack failed Debug: Remote helper quit.
comment:41 by , 11 years ago
While testing things I see a lot of code in CURL that replaces poll()
with select()
(the same as with , may be the problem is in this replacement, I will check that. There is also some strange pipe()
usage. Also needs to be checked.
comment:42 by , 11 years ago
Making my way through CURL, it's rather complex I must say. Pipe is not involved in our problem, the poll implementation seems to be correct. I now suspect OpenSSL as its that library which does the actual data sending in case of HTTP(S) in CURL (ossl_send). Must be something related to the size of the buffer to send as it's known to work on small repos and small pushes.
comment:43 by , 11 years ago
Another evidence of some problems with the packet size: I changed MAX_INITIAL_POST_SIZE
to a very low value of 1024 (originally it is 64k) and now the POST request worked fine. This value limits the size of the POST request when attaching it to the headers block for sending in a single packet. If the POST buffer exceeds this value, it is sent in separate chunks. So, somehow attaching it to headers screws the OS/2 TCP/IP stack (or OpenSSL, this is to be sorted out).
This, however, doesn't mean that it fully worked. It failed later, during write, with "Bad address' — just the same way it fails when I clone
the same repo from the local server. I have to figure out what's that.
comment:44 by , 11 years ago
The Bad address
error comes from the plain LIBC write
attempting to write to a file descriptor which is most likely the child end of the pipe created by the parent process. I'm trying to track down what this pipe is used for.
From what I see in LIBC code, EFAULT is usually returned when the input parameters are wrong. I can't see any code returning EFAULT that could be related to write except the __libsocket_safe_copy
intended to make sockets work with high memory but it doesn't seem to be our case since recompiling git in low memory mode gives the same thing.
comment:45 by , 11 years ago
Cc: | added |
---|
comment:46 by , 11 years ago
Why are you referencing komh for the bauxite port? komh has his own page with a Korean tld and the ports at the bauxite site are by some Japanese users.
comment:47 by , 11 years ago
Cc: | added |
---|
comment:48 by , 11 years ago
Steven, I have no idea why I refer to that build as KOMH's; really. It's since long but now I see that it's a build by Bauxite (with Sava patches). Some mixup happened back then.
JFTR, when trying to push with my POST hack applied to CURL I get this (l.e. doesn't work):
* SSL read: error:00000000:lib(0):func(0):reason(0), errno 60 * Closing connection 2 * The cache now contains 0 members error: RPC failed; result=56, HTTP code = 0 fatal: The remote end hung up unexpectedly fatal: The remote end hung up unexpectedly Debug: Remote helper: <- Everything up-to-date Debug: Disconnecting.
comment:49 by , 11 years ago
Okay, Bad address
in write()
comes from the crash that happens in git.exe
called with the index-pack
built-in command. It crashes in process in a rather strange way, with the XCPT_PROCESS_TERMINATE exception. May be related to fork
.
comment:50 by , 11 years ago
I added EXCEPTQ support to git the call stack trace upon the crash but it seems that adding EXCEPTQ makes the code crash before the original crash I’m looking for. At least I get this in the .TRP file:
______________________________________________________________________ Exception Report - created 2014/07/15 05:53:07 ______________________________________________________________________ OS2/eCS Version: 2.45 # of Processors: 4 Physical Memory: 3327 mb Virt Addr Limit: 2560 mb Exceptq Version: 7.10 (Mar 1 2011) ______________________________________________________________________ Exception C0000005 - Access Violation ______________________________________________________________________ Process: D:\CODING\GIT\DMIK-RUN\LIBEXEC\GIT-CORE\GIT.EXE PID: 4164 (16740) TID: 02 (2) Priority: 200 Filename: C:\USR\LIB\GCC473.DLL Address: 005B:1DDC00E3 (0001:000000E3) Cause: Attempted to write to 0288FF38 (uncommitted memory allocated by EXCEPTQ) ______________________________________________________________________ Failing Instruction ______________________________________________________________________ 1DDC00D2 CMP EAX, 0x1000 (3d 00100000) 1DDC00D7 LEA ECX, [ESP+0xc] (8d4c24 0c) 1DDC00DB JB 0x1ddc00f2 (72 15) 1DDC00DD SUB ECX, 0x1000 (81e9 00100000) 1DDC00E3 >OR DWORD [ECX], 0x0 (8309 00) 1DDC00E6 SUB EAX, 0x1000 (2d 00100000) 1DDC00EB CMP EAX, 0x1000 (3d 00100000) 1DDC00F0 JA 0x1ddc00dd (77 eb) ______________________________________________________________________
And this is on the screen:
Creating 4164_02.TRP Killed by SIGSEGV pid=0x4164 ppid=0x4163 tid=0x0002 slot=0x00a4 pri=0x0200 mc=0x0001 ps=0x0010 D:\CODING\GIT\DMIK-RUN\LIBEXEC\GIT-CORE\GIT.EXE GCC473 0:000000e3 cs:eip=005b:1ddc00e3 ss:esp=0053:0289ff2c ebp=0289fff4 ds=0053 es=0053 fs=150b gs=0000 efl=00010202 eax=0000106c ebx=00000007 ecx=0288ff38 edx=002df90c edi=002df90c esi=000a6544 Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.
comment:51 by , 11 years ago
Seems that the above problem is caused by EXCEPTQ itself. I disabled it and could run the clone
a bit longer this time (around 6 MB of repo data out of several hundred was downloaded) which ended up with this:
fatal: protocol error: bad line length character: шнГА error: inflate: data stream error (incorrect data check) fatal: pack has bad object at offset 8568233: inflate returned -3 fatal: index-pack failed Debug: Remote helper quit.
шнГA
is this sequence of bytes: {E8 AD 83 41} which is definitely not the line length. Perhaps the XCPT_PROCESS_TERMINATE crash I sometimes see is caused by a different example of garbage. This garbage is a clear evidence of the fact that the received packet gets corrupted somewhere. The pipeline it goes through looks as follows:
TCP/IP stack -> OpenSSL's recv -> CURL read -> Git
I did many checks of Git and CURL and still didn't find the failure which indirectly suggests that it's either OpenSSL or TCP/IP.
I will try to see what's on the wire with iptrace
and what gets delivered to Git. I don't know if it will give me any hint though.
comment:52 by , 11 years ago
Exceptq trap is most likely because you are using an antique exceptq on an SMP box. FWIW, the trap is in _alloca().
comment:53 by , 11 years ago
Well, IPTRACE is actually of no use as all interesting packets are encrypted by OpenSSL. Doesn't give me anything. I see a big packet that looks like the POST request sent by git (judging by the destination and size) but nothing comes from the github server in reply. Not a single packet.
Steven, yes, it's written in the log file that it's verson 7.10. I really wonder why the new install has disappeared. That's perhaps because of the new RPM package — I deleted all custom EXCEPTQ DLLs (except the default eCS one) and only installed exceptq-devel
afterwards that doesn't drag except
in.
Anyway, with the right DLL (7.11, 3 Mar 2014) it still crashes at the same place as version 7.10 but the report is a bit different:
______________________________________________________________________ Exception Report - created 2014/07/17 03:54:29 ______________________________________________________________________ OS2/eCS Version: 2.45 # of Processors: 4 Physical Memory: 3327 mb Virt Addr Limit: 2560 mb Exceptq Version: 7.11-shl (Mar 3 2014) ______________________________________________________________________ Exception C0000005 - Access Violation ______________________________________________________________________ Process: D:\CODING\GIT\DMIK-RUN\LIBEXEC\GIT-CORE\GIT.EXE PID: 4E (78) TID: 02 (2) Priority: 200 Filename: C:\USR\LIB\GCC473.DLL 02/05/2014 03:17:06 26,987 Address: 005B:1DDC00E3 (0001:000000E3) Cause: Attempted to write to 0288FF38 (uncommitted memory allocated by EXCEPTQ) ______________________________________________________________________ Failing Instruction ______________________________________________________________________ 1DDC00D2 CMP EAX, 0x1000 (3d 00100000) 1DDC00D7 LEA ECX, [ESP+0xc] (8d4c24 0c) 1DDC00DB JB 0x1ddc00f2 (72 15) 1DDC00DD SUB ECX, 0x1000 (81e9 00100000) 1DDC00E3 >OR DWORD [ECX], 0x0 (8309 00) 1DDC00E6 SUB EAX, 0x1000 (2d 00100000) 1DDC00EB CMP EAX, 0x1000 (3d 00100000) 1DDC00F0 JA 0x1ddc00dd (77 eb)
comment:54 by , 11 years ago
OK. Did you post the entire .trp file or just the header? I'm going to need to see a process dump so that I can figure out what is calling alloca. Exceptq does not use alloc explicitly. If you don't have it, grab the latest version of pdumpctl from my os2diags page and set it up for a "normal" dump and let me know where to grab it from.
One thing worth trying unloading exceptq and moving distorm.dll out of LIBPATH and seeing what effect, if any, this has on the trap.
I did not expect the iptrace to be readable. What I would be looking for is the packet deltas and if there where and retries. This is mostly to confirm what the git/curl debug output reported.
comment:55 by , 11 years ago
Cc: | removed |
---|
comment:56 by , 11 years ago
Cc: | added |
---|
comment:57 by , 10 years ago
Steve, yes, that was only the header. The full version of the .TRP and the IPTRACE dump is here: http://rpm.netlabs.org/test/git-test-steve-1.zip.
I will try to remove distorm.dll
when I try this next time.
comment:58 by , 10 years ago
Steve, here http://rpm.netlabs.org/test/git-test-steve-2.zip is the .TRP file with distorm.dll
disabled. It only differs that it complains no DISTORM or DIS386 is found.
comment:59 by , 10 years ago
The full .trp file makes is obvious that we have stack overflow.
Stack Info for Thread 02
00010000 028A0000 -> 0289FF2C -> 02890000 -> 02890000
Alloca() is touching the stack to ensure the the pages are committed so that the stack content can be accessed in any order.
ECX : 0288FF38 is outside the stack and EAX : 0000106C tells us how much stack was left to be touched when the exception triggered.
64K is a smallish stack for apps ported from *ix. However, it is the default used by pthreads. I recommend you tweak the pthread_create calls to allocate something like a 512KB stack. git does not appear to create a lot of threads, so this should not cause memory availability issues.
follow-up: 62 comment:60 by , 10 years ago
I have updated the OpenSSL port to version 1.0.0n (the previous 1.0.0a was 4 years old) and according to my tests the problems with the POST packet have gone — the server sends the answer and git goes further. However, there is still a problem with crashing the second child git application — the reason is not yet known.
BTW, OpenSSL 1.0.1i (which is now our trunk) is also merged in but it has some problems with linking the DLL (they use some tools like ar
that are currently unreliable on OS/2). I will make it build later. I created #30 for it.
Steven, good guess about the stack condition (I overlooked that in .TRP). This must be the case as I temporarily disabled the code that increased the default stack size during some of my tests. I've now reenabled it back and we will see if the EXCEPTQ problem goes away.
by , 10 years ago
Attachment: | iptrace-fmt-20140809-1036.zip added |
---|
Annotated ipformat output of iptrace.dmp for git-test-steve-1.zip
comment:61 by , 10 years ago
Following up on the iptrace.dmp from git-test-steve-1.zip...
It appears that a fragment reassembly error occurred. This is what packet# 126 ICMP is reporting.
It's possible that packet# 376 is related, but there's not enough trace to say for sure and IP 192.030.252.129 does not seem to be related to the transmissions that did complete without error.
If the problem persists, one could experiment with the fragttl setting and see if this has any effect.
It also might be useful to use wireshark to decode ICMP data. You can use ipformat -x to convert iptrace.dmp to something that wireshark understands.
comment:62 by , 10 years ago
Replying to dmik:
Steven, good guess about the stack condition (I overlooked that in .TRP). This must be the case as I temporarily disabled the code that increased the default stack size during some of my tests. I've now reenabled it back and we will see if the EXCEPTQ problem goes away.
In hindsight, I don't really think we had an exceptq problem per se.
More likely, loading exceptq used enough additional stack to trigger the overflow.
comment:63 by , 10 years ago
The ECXEPTQ problem has gone indeed. So it was the lack of the stack space for it to handle some exception. The strange thing is that I don't get any crashes now at all. The (3rd) child process just terminates w/o crashing — I need to find why. BTW, the process chain when cloning from HTTP is as follows (JFTR):
1. git.exe clone <URL> 2. git-remote-https.exe origin <URL> 3. git.exe fetch-pack --stateless-rpc --stdin -- lock-pack --thin --check-self-contained-and-connected --cloning --no-progress <URL> 4. git.exe index-pack --stdin --fix-thin --keep=fetch-pack <PID_3> on <HOST> --check-self-contained-and-connected --pack-header=X,Y
Process 2 is the one that creates the HTTPS connection through Curl which opens a socket via OpenSSL. This socket is then used to read data from the remote server. This data is then flushed to Process 3 via a pipe connected to its stdin. Process 3 in turn sends some data further to Process 4 via a pipe connected to its stdin as well.
This run happens to download about 10MB and then it terminates with the following error coming from Process 2:
D:/Coding/ports/curl/curl-7.37.0/lib/select.c:488:Curl_poll: errno 14 (Bad address)
This usually happens when the child process closes its end of the pipe (e.g. by terminating). It's not clear yet for me why this happens. Quite a complex execution chain, I need to add more debugging.
I also rechecked the push test with the new OpenSSL and with iptrace enabled. And, surprisingly, after a number of failed attempts it worked! I'm not yet sure if it's OpenSSL that cured it or not because I found out that my ISP is having heavy connectivity problems at home and my internet drops off very often. Other failed attempts could be the internet connection loss as well. But when I switched to Mobile, it worked right away. I now thing that this connection problem may somehow be a reason of the clone failure. I can't tell for sure because I can't clone that much through my mobile's 3G network. I hope that IPS will fix the problem tomorrow and I will retry.
comment:64 by , 10 years ago
Steven, thanks for annotating the IPTRACE dump and for your hints. I will try fragttl that once I restore my broadband connection. Some comments regarding what git did there. It opened a connection, exchanges some system information with the server (all succeeded). Then it tried the POST request but didn't receive any answer. It decided there was a network problem and reconnected (to the new IP since github responds on several IPs) — just to send this POST request again. The packet whose size is 1490 bytes is this packet. What puzzles me is why there are many packets of this size in the trace. I'm sure that the software generates it only two times (the first attempt and the second one). Does that mean that Curl/OpenSSL is just trying to send it over and over again for some reason? I see that after some of these packets the server sends a reply but I can't understand then why they are resent (yet).
And what do you mean by saying that packet #376 is late?
comment:65 by , 10 years ago
Okay, after resolving my ISP's connectivity issues I tested clone
again this indeed turned out to be the lack of the stack! It was in the air. There is already r354 where Yuri attempted to solve a similar problem by increasing the stack size from 64K (the default for his pthread_create implementation) to 640K but that's apparently not enough for some cases. After increasing the stack to 640M (just to be sure that's enough :) I could *almost* successfully clone the giant Paul's GCC repo from the link at the top of the ticket (740 MB in size).
It failed at the very end, however, in mmap():
Debug: Remote helper: <- lock D:/Coding/1/.git/objects/pack/pack-591156e16349f58059bd987225c3c7237e78b351.keep Debug: Remote helper: Waiting... Debug: Remote helper: <- connectivity-ok Debug: Remote helper: Waiting... Debug: Remote helper: <- trace: run_command: 'rev-list' '--objects' '--stdin' '--not' '--all' trace: exec: 'git' 'rev-list' '--objects' '--stdin' '--not' '--all' *** [61507:1] git.c:551:main: xcpt_rec 0x2dff48 *** [61507:1] compat/os2.c:879:git_os2_main_prepare: xcpt_rec 0x2dff48 trace: built-in: git 'rev-list' '--objects' '--stdin' '--not' '--all' fatal: Out of memory? mmap failed: Invalid argument Unexpected end of command stream
The big difference is that the index (the contents of .git) was left intact by git after this failure which (most likely) means that all repo data had been successfully downloaded from the server and that this failure was related not to the network operations but to the procedure of checking out a working copy from the index (where mmap failed). I will look at mmap tomorrow. This is a very good progress, anyway.
Note that the man page says that "On Linux/x86-32, the default stack size for a new thread is 2 megabytes", I guess we should do the same in the OS/2 implementation of pthread_create to be more compatible with the *nix world. Since the original git code doesn't increase the default stack size, this should be enough for normal git usage. I will talk to Yuri about that.
comment:66 by , 10 years ago
Ok, mmap()
fails because git sets its offset
argument to non-zero which is not supported by our MMAP implementation. This is done when walking through the index pack file (use_pack() in
sha1_file.c`).
Note that the git builds from Bauxite didn't have this problem because they were built with NO_MMAP — in this case git uses a surrogate that simply allocates a buffer in memory and reads the file contents into that buffer. And this surrogate supports the offset
argument well.
I will check how difficult it is to support offset
in our MMAP and then we will decide. This argument is just an offset within a file to map that defines where to start mapping the file contents from. In theory this should be easy to implement. It would be nice to have this feature in since I guess there are other apps using this that we want or may want to port.
comment:67 by , 10 years ago
I've implemented the offset thingy in MMAP. The test case from here http://man7.org/linux/man-pages/man2/mmap.2.html works fine and I'm now trying to clone the Mozilla/GCC repos to see if it solves the problem.
This improvement required to change both WPSTK and MMAP. We have both on SVN here but Yuri actually uses his private build for MMAP — the one that is on SVN is somewhat different. I will have to commit Yuri's changes to WSTK (plus my own) to SVN to make it consistent and easy to maintain. However, this will take some time as the WPSTK repo is itself a bit messy (an import from SVN etc).
comment:68 by , 10 years ago
The Mozilla repo has been successfully checked out! Now trying GCC (that will take longer).
comment:69 by , 10 years ago
I congratulate everyone — the GCC repository has been successfully cloned too.
Now I have to commit all patches to all libraries that I made while making this beast work and then I will close this defect.
comment:70 by , 10 years ago
just to give a short progress on that:
we are now creating all needed rpm's to have git installed by rpm.
those are:
m4 (done)
automake (done)
autoconf (done)
libtool (done)
curl (done)
expat (done)
openssl (done)
git
comment:71 by , 10 years ago
Yes, there has been quite a lot of package updates and quite a lot of work: besides fixing various OS/2 specific glitches all packages got their very recent versions. Many packages got their first RPM versions on OS/2 (automake, libtool, expat). There was also a fix for our mmap emulation that is now used in git (also in RPM).
comment:72 by , 10 years ago
I finally rebuilt and retested everything with the latest changes and fixes (still 2.0.0 due to #37). I also removed a bit of sava's code that isn't really needed (for the sake of clarity and ease of maintenance, see r861 and r862).
I did some local tests on the resulting build (clone, commit, push, pull) and all seems to work for me. I'm going to release an RPM now.
comment:73 by , 10 years ago
comment:74 by , 10 years ago
Summary: | git: Apply patches by komh → git: Apply patches by bauxite |
---|
The RPM is out. You may now install it with yum git
. The current saga is over, I'm closing this ticket (finally).
Note that although there are several other RPM packages besides git
itself (e.g. git-svn
) they can't be installed at the moment due to missing perl dependencies. In particular, git-svn
needs the SVN::Core
perl module which is part of the subversion distribution. Our svn RPM is a package for the binary build of subversion from Paul Smedley so it lacks the perl stuff (as well as Paul's zips). Anyway, this is a separate task unrelated to git itself.
comment:75 by , 10 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
BTW, I've also renamed the 'komh' branch to 'bauxite' — to match the reality.
I have created a branch (branches/komh) to apply his patches dated 20111002 (latest build from the web page).
I have also applied the patches but for some reason large repositories (like https://github.com/dmik/qt-creator-os2.git) still can't be cloned. Here is what I get:
This is exactly the same as I get with the current SVN build of trunk (i.e. w/o the komh patches). This gives us a hint that it may be a regression of updating our git to 1.7.6.1 (remember that the original version released by komh is 1.7.3.2).