Opened 11 years ago

Closed 10 years ago

#19 closed defect (fixed)

git: Apply patches by bauxite

Reported by: dmik Owned by:
Priority: major Milestone:
Component: *none Version:
Severity: Keywords:
Cc: dryeo, steve53@…

Description

komh maintains its own build of git here: http://bauxite.sakura.ne.jp/software/os2/. These patches contain some important fixes: in particular, they fix cloning large repositories (where our RPM git fails) and "out of memory" when doing gc/repack.

Note that komh patches already contain sava patches that our SVN also contains but komh uses git 1.7.3.2 source base.

Attachments (1)

iptrace-fmt-20140809-1036.zip (10.2 KB) - added by Steven Levine 10 years ago.
Annotated ipformat output of iptrace.dmp for git-test-steve-1.zip

Download all attachments as: .zip

Change History (76)

comment:1 Changed 11 years ago by dmik

I have created a branch (branches/komh) to apply his patches dated 20111002 (latest build from the web page).

I have also applied the patches but for some reason large repositories (like https://github.com/dmik/qt-creator-os2.git) still can't be cloned. Here is what I get:

D:>git clone https://dmik@github.com/dmik/qt-creator-os2.git .
warning: $GIT_FIND is not defined. Assume C:/usr/bin/find.
warning: $GIT_SORT is not defined. Assume C:/usr/bin/sort.
Cloning into ....
Password:
remote: Counting objects: 232323, done.
remote: Compressing objects: 100% (37515/37515), done.
error: RPC failed; result=56, HTTP code = 200MiB | 212 KiB/s
fatal: The remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed
warning: https unexpectedly said: '0000'

This is exactly the same as I get with the current SVN build of trunk (i.e. w/o the komh patches). This gives us a hint that it may be a regression of updating our git to 1.7.6.1 (remember that the original version released by komh is 1.7.3.2).

comment:2 Changed 11 years ago by dmik

The good thing is that komh patches fix the problem with starting scripts from libexec/git-core which exists on the trunk.

comment:3 Changed 11 years ago by dmik

Patches are committed to the branch in r609 (with a small fix in r610).

comment:4 Changed 11 years ago by dmik

Note that the solutions like this one http://stackoverflow.com/questions/6842687/the-remote-end-hung-up-unexpectedly-while-git-cloning don't actually help. I don't know if it is related at all. This needs deeper debugging.

comment:5 Changed 10 years ago by dmik

Summary: git: Apply patches by komhPrioritize SHELL over EMXSHELL
Type: taskdefect

I'm now trying to finish my git work in the meanwhile (while there is a delay with making WLINK work with the recent Firefox).

comment:6 Changed 10 years ago by dmik

Summary: Prioritize SHELL over EMXSHELLgit: Apply patches by komh на Prioritize SHELL over EMXSHELL
Type: defecttask

comment:7 Changed 10 years ago by dmik

Summary: git: Apply patches by komh на Prioritize SHELL over EMXSHELLgit: Apply patches by komh
Type: taskdefect

AAAAAAAAAAAAAA. Safari on Mac has run completely out of mind. It substitutes the same values from the last ticket I posted to (http://trac.netlabs.org/libc/ticket/287#comment:5).

comment:8 Changed 10 years ago by dmik

Type: defecttask

comment:9 Changed 10 years ago by dmik

Type: taskdefect

I realize that http-push is absent from both my builds and the RPM build by Yuri (but present in the original KOMH builds). This seems to depend on CURL 7.9.8 or above (we have 7.21.1) and EXPAT. But we don't have libexpat (at least not in RPM) so git-http-push.exe is not built. This explains why our "stock" RPM build couldn't push over HTTP. I don't get how my local build can do that though since I don't have git-http-push.exe too... May be I hacked it somehow, I don't actually remember. This has always been an unfinished hack...

Anyway, this doesn't explain why push over HTTP works but also fails with various RPC errors and unexpected messages in the original KOMH build. But it may point at the area of searching at least.

And the HTTP push problem is the most I suffer from now. Only 2 of my 10 commits work (with either of the builds I have). I often have to use my Mac machine in order to commit Mozilla changes. This somehow seems to depend on the size of the commit object. Very small commits usually work very well. But if it's bigger than some size, then no way. I can't recall the exact error message, I will paste it next time I see it.

The HTTP clone problem may be somehow related to the HTTP push problem. But the fact is that the original build from KOMH (git 1.7.3.2) is free from that problem... As I already guess this may be a regression of some sort. May be updating to 1.8/1.9.2.0 will help, we will see. I need to collect more details on all that.

This is all a bit messed up in my head ATM. Too many things are on the plate. Will continue sorting them out.

comment:10 Changed 10 years ago by dmik

Type: defecttask

NO ITS NOT A DEFECT, IT"S A TASK! SAFARI, WHOEVER, STOP!!

comment:11 Changed 10 years ago by dmik

Type: taskdefect

Btw, this is what I usually get now here at netlabs:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/trac/web/api.py", line 514, in send_error
    data, 'text/html')
  File "/usr/local/lib/python2.7/site-packages/trac/web/chrome.py", line 968, in render_template
    message = Markup(req.session.pop('chrome.%s.%d'
  File "/usr/local/lib/python2.7/site-packages/trac/web/api.py", line 316, in __getattr__
    value = self.callbacks[name](self)
  File "/usr/local/lib/python2.7/site-packages/trac/web/main.py", line 268, in _get_session
    return Session(self.env, req)
  File "/usr/local/lib/python2.7/site-packages/trac/web/session.py", line 200, in __init__
    if req.authname == 'anonymous':
  File "/usr/local/lib/python2.7/site-packages/trac/web/api.py", line 316, in __getattr__
    value = self.callbacks[name](self)
  File "/usr/local/lib/python2.7/site-packages/trac/web/main.py", line 135, in authenticate
    authname = authenticator.authenticate(req)
  File "/usr/local/lib/python2.7/site-packages/trac/web/auth.py", line 91, in authenticate
    req.incookie['trac_auth'])
  File "/usr/local/lib/python2.7/site-packages/trac/web/auth.py", line 238, in _get_name_for_cookie
    name = self._cookie_to_name(req, cookie)
  File "/usr/local/lib/python2.7/site-packages/trac/web/auth.py", line 234, in _cookie_to_name
    for name, in self.env.db_query(sql, args):
  File "/usr/local/lib/python2.7/site-packages/trac/db/api.py", line 122, in execute
    return db.execute(query, params)
  File "/usr/local/lib/python2.7/site-packages/trac/db/util.py", line 121, in execute
    cursor.execute(query, params)
  File "/usr/local/lib/python2.7/site-packages/trac/db/util.py", line 65, in execute
    return self.cursor.execute(sql_escape_percent(sql), args)
  File "/usr/local/lib/python2.7/site-packages/trac/db/sqlite_backend.py", line 78, in execute
    result = PyFormatCursor.execute(self, *args)
  File "/usr/local/lib/python2.7/site-packages/trac/db/sqlite_backend.py", line 56, in execute
    args or [])
  File "/usr/local/lib/python2.7/site-packages/trac/db/sqlite_backend.py", line 48, in _rollback_on_error
    return function(self, *args, **kwargs)
OperationalError: database is locked

comment:12 Changed 10 years ago by dmik

Back to git, the sad thing is that it doesn't allow building out of the source tree. There is configure that is supposed to support this but apparently there are mistakes in Makefiles (places that don't take this scenario into account).

comment:13 Changed 10 years ago by dmik

Type: defecttask

comment:14 Changed 10 years ago by dmik

This is completely disgusting and mind blowing. I don't understand how such a nice and (relatively) small tool may have such a creepy house keeping. Absurd. I will have to feel pain as I don't want to clean out this dirt ATM (no time and I'm really tired to be a dirt cleaner in every house).

Last edited 10 years ago by dmik (previous) (diff)

comment:15 Changed 10 years ago by dmik

Type: taskdefect

Meanwhile, I found that the latest expat version (2.1.0) includes some bits of OS/2 already: http://expat.cvs.sourceforge.net/viewvc/expat/expat/watcom/?hideattic=0&pathrev=R_2_1_0. It's intended for satcom but I think it shouldn't be difficult to build it with GCC — DOS-related changes are marked with WATCOM (and perhaps the ones with WIN32 should be paid attention).

The ports ticket: #23.

comment:16 Changed 10 years ago by dmik

I finally built git 1.7.9.6 from this SVN's trunk (i.e. no KOMH patches) using the RPM env and our own expat. The two major problems are still here:

  1. it can't clone big repos like https://github.com/psmedley/gcc. Fails with:
    Cloning into '.'...
    error: RPC failed; result=52, HTTP code = 0
    fatal: The remote end hung up unexpectedly
    
  1. It can't push big commits (any repo). Fails with:
    error: RPC failed; result=52, HTTP code = 0
    fatal: The remote end hung up unexpectedly
    fatal: The remote end hung up unexpectedly
    

With the build from the KOMH branch, problem 1 disappears, problem 2 is still there.

I will do the following now:

  • Update trunk to a more recent version, 1.8/1.9/2.0, to see if the problems still persist.
  • If yes, then sort out KOMH patches to find which change fixes problem 1.
  • Try to fix problem 2 as well.
  • Check other KOMH patches and apply those we need.

comment:17 Changed 10 years ago by dmik

I've imported top releases of each version (1.8.5.5, 1.9.4 and 2.0.0). But of course there is a problem with merging due to unusual file names test cases use (and the weakness of our SVN client in handling this). That's what I get:

svn: Unable to parse URL '/repos/ports/git/vendor/1.7.9.6/t/t4013/diff.format-patch_--inline_--stdout_initial..master^^'
svn: Error reading spooled REPORT request response

The error is completely meaningless. The name itself looks pretty legal, other SVN commands (checkout, update) handle it well. So I really wonder what it could be (I'm using SVN 1.6.16 from Paul AFAIR). But anyway, I'm not going to fix SVN right now as it's already too much on my pipeline.

I will create a temp branch (will name it dmik) and experiment on that branch to see which update we can use on OS/2 with less effort. A temporary branch is necessary to merge trunk and vendor/2.0.0 on Mac (where I have subversion 1.7 and all works).

Last edited 10 years ago by dmik (previous) (diff)

comment:18 Changed 10 years ago by Silvan Scherrer

Thats exactly the same error I had with Ghostscript. So I decided to update to 1.7 and the error went away.
See http://os2ports.smedley.id.au/index.php?page=subversion for download links.

comment:19 Changed 10 years ago by dmik

I've just tried SVN 1.7 from Paul and it works well, no errors like that. Great. We will do an RPM for it too.

comment:20 Changed 10 years ago by dmik

I've updated git to 2.0.0 and unfortunately this didn't fix any of the RPC failures. The errors are completely the same. This means that the problem is somehow OS/2 specific. I will have to step debug it and study KOMH patches once again.

comment:21 Changed 10 years ago by dmik

Tried to grab git_os2_read()/git_os2_write() and _select2() changes from KOMH patches, but none seems to help with the HTTP problems. There is also a pipe/poll reimplementation which uses native OS/2 pipes. I will look closer if it may be related. The other code doesn't seem to be relative to the HTTP transport at all so that must be it...

comment:22 Changed 10 years ago by dmik

I've applied the pipe/poll code from KOMH and it's still not working. Need to step debug the relevant code and perhaps expat.

comment:23 Changed 10 years ago by dmik

Using GIT_CURL_VERBOSE=1 is very handy to debug HTTP(S) sessions. From what I see, there are the following differences between the original 1.7.3.2 version from KOMH (works), 1.7.9.6 trunk (doesn't) and 2.0.0 dmik branch (doesn't):

  • Both 1.7.9.6 and 2.0.0 spit messages like 0x20059d60 is at send pipe head! while connecting to github and getpeername() failed with errno 36: Operation now in progress. 1.7.3.2 from KOMH doesn't do that.
  • 1.7.3.2 issues POST /psmedley/gcc/git-upload-pack HTTP/1.1 with Content-Length: 1115 and Expect: 100-continue and gets a successful reply with Content-Type: application/x-git-upload-pack-result.
  • 1.7.9.6 issues this POST with Content-Length: 1123 and fails with Empty reply from server.
  • 2.0.0 issues this POST with Content-Length: 1140 and fails with Empty reply from server.

comment:24 Changed 10 years ago by dmik

Content-Length doesn't seem to be directly related; git 1.8.5.2 on Mac has it set to 1151 and all works. However, on Mac git also says * upload completely sent off: 1151 out of 1151 bytes before getting a reply from github. Though this may be a version difference (more logging in 1.8) this gives a hint that for some reason content of the POST request is not sent to the server in our versions 1.7.9.6 and 2.0.0. I will check this.

Another thing is that AFAIR my own build of the version with KOMH patches behaved just like the current trunk (i.e. didn't work). But it never was 1.7.3.2, it was 1.7.9.6 with KOMH patches from 1.7.3.2 applied. So there are two possibilities: either KOMH versions of SSL and CRYPTO libraries (kssl and crypto) differ from what we have in terms of working with HTTP or there was some change after 1.7.3.2 that made KOMH patches ineffective. This also needs some thinking and checking.

comment:25 Changed 10 years ago by dmik

Another thing to mention is that the original KOMH build links to both SSL and CRYPTO but our builds only link to CRYPTO and also the KOMH build doesn't link to CURL which means he uses his own static CURL build (which could also contain some private fixes). Also the KOMH build doesn't use MMAP (but I don't think it's related to these problems).

comment:26 Changed 10 years ago by dmik

Yes, KOMH has his own static build of CURL (http://bauxite.sakura.ne.jp/software/os2/misc/curl-7.20.1-os2-20100522.zip) which also depends on LIBSSH2 (http://bauxite.sakura.ne.jp/software/os2/misc/libssh2-1.2.4-os2-20100301.zip). However, quickly linking git-remote-http.exe against it (which dropped CURL7.DLL dependency and dragged in SSL10.DLL) didn't cure the problem. So it seems to be irrelevant.

I guess it has something to do with strange messages like 0x20059d60 is at send pipe head! and getpeername() failed with errno 36: Operation now in progress (which are results of git evolution after 1.7.3.2) then. I will concentrate on them and check how HTTP work.

comment:27 Changed 10 years ago by dmik

JFTR, linking against KOMH's SSL and CRYPTO doesn't help either. This, again, narrows the problem down to the new post-1.7.3.2 git code.

comment:28 Changed 10 years ago by dmik

Performed one more test: built git 1.7.3.2 with original KOMH patches, both using KOMH libs and using RPM libs (curl/ssh/crypto). Works, regardless of libs. So it's definitely the change in git after 1.7.3.2 that broke it. Okay, good to know that.

comment:29 Changed 10 years ago by dmik

Another observation: 0x20059d60 is at send pipe head! and getpeername() failed with errno 36: Operation now in progress are not in vain as I see it in my 1.7.3.2 build too but it works. It's now clearly seen that the place where things break is the POST /psmedley/gcc/git-upload-pack HTTP/1.1 request. The old git gets an immediate reply to that the new git seems to not get any reply and finally fails with the message The remote end hung up unexpectedly. The only other visible difference between different Content-Length is that the old git sets also Expect: 100-continue. The new git doesn't. But that's unlikely to be the cause since new gits on other platforms don't set this header field at all.

comment:30 Changed 10 years ago by dmik

Just to make sure we don't miss something important outside I updated our CURL port from 7.21.1 to the latest 7.37. I got these new * upload completely sent off: 1140 out of 1140 bytes messages in the debug output with set GIT_CURL_VERBOSE=1 but the problem is still the same. Okay, at least we have the newest CURL. I will commit it later, when git is done (the new CURL doesn't actually require a lot of work, just two small patches in lib/url.c, one old from Yuri and one new from me (trivial); more over, given our new and shiny autoconf toolchain, it eliminates the need of other Yuri's patches and builds just out of the box).

comment:31 Changed 10 years ago by dmik

Another interesting thing with the new CURL is that 0x20059d60 is at send pipe head! and getpeername() failed with errno 36: Operation now in progress have disappeared. They come from the old CURL then and should be ignored.

comment:32 Changed 10 years ago by dmik

I compared the RPC packet sent on clone by git 1.7.3.2 (which works) and 2.0.0 (which doesn't). There are only minor differences (the later version of git has a couple of additional fields), the rest is identical. More over, I grabbed the working packet and made it send it from the trunk version instead of its own data — and it still doesn't work. This proves that the problem is not the packet contents but the connection itself. Somehow, newer git versions establish it differently on OS/2 and that confuses the git server.

Note that I also tried to build 2.0.0 without USE_CURL_MULTI (which enables simultaneous HTTP transfers) since this mode was absent in 1.7.3.2 and it downloaded everything in one stream. But this didn't help either. So so the problem is not related to the multithreaded transfer mode.

comment:33 Changed 10 years ago by dmik

Just noticed that git uses fork() to start its commands on OS/2. And that it also uses so-called "stateless-rpc" mode where the child process doesn't open a new HTTP connection but instead pipes to and from the server through the parent (which holds an open connection) via stdin/stdout. Maybe this is where things break as it's known our fork() impl isn't perfect. One doubt is that git 1.7.3.2 also uses fork() and all works. But anyway it's good to get rid of fork in this particular case as it's used too intensively.

comment:34 Changed 10 years ago by dmik

I replaced fork with spawnvpe and it still doesn't work. Wrong path... I will leave the spawnvpe changes in though since they are good to have anyway. Will commit them when I find what's wrong.

comment:35 Changed 10 years ago by dmik

I'm comparing git 2.0.1 output on Mac with git 2.0.0 on OS/2 with more logging (-v -v -v -v, GIT_TRANSPORT_HELPER_DEBUG=1, GIT_TRANSLOOP_DEBUG=1, GIT_CURL_VERBOSE=1, GIT_TRACE=1, GIT_TRACE_PACKET=1) and the Mac version issues the very same packet (both contents and length), and all the sequence of operations is completely the same and it works (in fact, the trace log output is almost identical). This leads me to a conclusion that the problem may lie in the CURL library itself but it is simply not triggered by earlier git versions because they don't use some features. I will add more debugging to CURL then.

Among the differences is the fact that on mac a TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 is established while on OS/2 it is SSL connection using TLSv1.0 / AES128-SHA. I wonder if this can be the source of problems...

comment:36 Changed 10 years ago by dmik

I simplified the situation down to the following. I installed a local Apache server to host the GCC git repository locally and see what's going on from the other side (this is not a clean test case though since both github and bitbucket use their own HTTP servers for git, but still). I didn't bother setting HTTPS and used plain HTTP (and due to my simple setup POST is also not used, only GET — which makes things a bit easier to track down). The Mac client works, the OS/2 client fails but with different diagnosis:

Getting pack e9d9ca0c0c74d2fea9e6114e33269ea6db4ec519
 which contains d09a35b6abdfcc5c88ac0ba1543ea5f37a4df113
* Couldn't find host 192.168.1.102 in the .netrc file; using defaults
* Found bundle for host 192.168.1.102: 0x2004a038
* Re-using existing connection! (#3) with host 192.168.1.102
* Connected to 192.168.1.102 (192.168.1.102) port 8888 (#3)
> GET /git/gcc.git/objects/pack/pack-e9d9ca0c0c74d2fea9e6114e33269ea6db4ec519.pack HTTP/1.1
User-Agent: git/2.0.0
Host: 192.168.1.102:8888
Accept: */*
Accept-Encoding: gzip

< HTTP/1.1 200 OK
< Date: Thu, 03 Jul 2014 19:45:10 GMT
* Server Apache/2.2.26 (Unix) mod_fastcgi/2.4.6 mod_wsgi/3.4 Python/2.7.6 PHP/5.5.10 mod_ssl/2.2.26 OpenSSL/0.9.8y DAV/2 mod_perl/2.0.8 Perl/v5.18.2 is not blacklisted
< Server: Apache/2.2.26 (Unix) mod_fastcgi/2.4.6 mod_wsgi/3.4 Python/2.7.6 PHP/5.5.10 mod_ssl/2.2.26 OpenSSL/0.9.8y DAV/2 mod_perl/2.0.8 Perl/v5.18.2
< Last-Modified: Thu, 03 Jul 2014 15:27:58 GMT
< ETag: "109a179-2ab7dd29-4fd4ba74c1b80"
< Accept-Ranges: bytes
< Content-Length: 716692777
< Content-Type: text/plain
< 
* Closing connection 3
error: Unable to get pack file http://192.168.1.102:8888/git/gcc.git/objects/pack/pack-e9d9ca0c0c74d2fea9e6114e33269ea6db4ec519.pack
The requested URL returned error: 404 Not Found
error: Unable to find d09a35b6abdfcc5c88ac0ba1543ea5f37a4df113 under http://192.168.1.102:8888/git/gcc.git
Cannot obtain needed object d09a35b6abdfcc5c88ac0ba1543ea5f37a4df113
error: Fetch failed.
Debug: Remote helper quit.

So it somehow can't download the pack file. It sees the file and the server replies with the correct size response but git aborts. It's certainly the git for OS/2 problem as, again, on Mac it finds this pack file is downloaded successfully. I hope that breaking this down will help me with the POST code path as well.

comment:37 Changed 10 years ago by dmik

This is somehow related to TCP socket transfer timings. When I added more logging my local server started working: the given .pack file was successfully downloaded by git. I'm trying to find a git server that would serve HTTPS and support the git-upload-pack service that involves the POST method.

comment:38 Changed 10 years ago by dmik

I tried the GitBlit? local server http://gitblit.com/ but apparently it somehow is broken. Neither git on mac nor git on OS/2 can clone via HTTPS from it. Looking for another one...

comment:39 Changed 10 years ago by dmik

Okay, I simply set up Apache on Mac using MAMP (http://www.mamp.info), then enabled the upload-pack backend using the official git manual (https://www.kernel.org/pub/software/scm/git/docs/git-http-backend.html) and got it work in this mode with POST. Git on mac works well in this setup and, guess what, git on OS/2 works well too! It gets a reply from the server after issuing the POST /psmedley/gcc/git-upload-pack HTTP/1.1 request and all goes fine after that.

This means it's either logging timing again or the problem is related to the external network interface setup (the server is run in my LAN so the traffic doesn't go through the router to the outer world).

comment:40 Changed 10 years ago by dmik

It actually doesn't work on the local server after rebooting the OS/2 machine until I add full logging again. Definitely, timings. And not related to the type of the interface. Fails with this:

POST git-upload-pack (gzip 2252 to 1140 bytes)
*** [1] remote-curl.c:636:post_rpc:
*** [265:1] D:/Coding/ports/curl/curl-7.37.0/lib/http.c:2507:Curl_http: postfields 0x201ac220 (1140)
*** [265:1] D:/Coding/ports/curl/curl-7.37.0/lib/http.c:2517:Curl_http: postsize 1140
fatal: write error: Bad address
fatal: early EOF
fatal: index-pack failed

I.e. it can't even write the POST packet out.

But even with logging enabled, after successfully receiving the pack file git starts the index-pack command and that also fails on OS/2 with the following output:

trace: run_command: 'index-pack' '--stdin' '--fix-thin' '--keep=fetch-pack 225 on hugaida' '--check-self-contained-and-connected' '--pack_header=2,1317807'
trace: exec: 'git' 'index-pack' '--stdin' '--fix-thin' '--keep=fetch-pack 225 on hugaida' '--check-self-contained-and-connected' '--pack_header=2,1317807'
trace: built-in: git 'index-pack' '--stdin' '--fix-thin' '--keep=fetch-pack 225 on hugaida' '--check-self-contained-and-connected' '--pack_header=2,1317807'
fatal: write error: Bad address
fatal: early EOF
fatal: index-pack failed
Debug: Remote helper quit.

comment:41 Changed 10 years ago by dmik

While testing things I see a lot of code in CURL that replaces poll() with select() (the same as with , may be the problem is in this replacement, I will check that. There is also some strange pipe() usage. Also needs to be checked.

comment:42 Changed 10 years ago by dmik

Making my way through CURL, it's rather complex I must say. Pipe is not involved in our problem, the poll implementation seems to be correct. I now suspect OpenSSL as its that library which does the actual data sending in case of HTTP(S) in CURL (ossl_send). Must be something related to the size of the buffer to send as it's known to work on small repos and small pushes.

comment:43 Changed 10 years ago by dmik

Another evidence of some problems with the packet size: I changed MAX_INITIAL_POST_SIZE to a very low value of 1024 (originally it is 64k) and now the POST request worked fine. This value limits the size of the POST request when attaching it to the headers block for sending in a single packet. If the POST buffer exceeds this value, it is sent in separate chunks. So, somehow attaching it to headers screws the OS/2 TCP/IP stack (or OpenSSL, this is to be sorted out).

This, however, doesn't mean that it fully worked. It failed later, during write, with "Bad address' — just the same way it fails when I clone the same repo from the local server. I have to figure out what's that.

comment:44 Changed 10 years ago by dmik

The Bad address error comes from the plain LIBC write attempting to write to a file descriptor which is most likely the child end of the pipe created by the parent process. I'm trying to track down what this pipe is used for.

From what I see in LIBC code, EFAULT is usually returned when the input parameters are wrong. I can't see any code returning EFAULT that could be related to write except the __libsocket_safe_copy intended to make sockets work with high memory but it doesn't seem to be our case since recompiling git in low memory mode gives the same thing.

Last edited 10 years ago by dmik (previous) (diff)

comment:45 Changed 10 years ago by Steven Levine

Cc: Steven Levine added

comment:46 Changed 10 years ago by dryeo

Why are you referencing komh for the bauxite port? komh has his own page with a Korean tld and the ports at the bauxite site are by some Japanese users.

comment:47 Changed 10 years ago by dryeo

Cc: dryeo added

comment:48 Changed 10 years ago by dmik

Steven, I have no idea why I refer to that build as KOMH's; really. It's since long but now I see that it's a build by Bauxite (with Sava patches). Some mixup happened back then.

JFTR, when trying to push with my POST hack applied to CURL I get this (l.e. doesn't work):

* SSL read: error:00000000:lib(0):func(0):reason(0), errno 60
* Closing connection 2
* The cache now contains 0 members
error: RPC failed; result=56, HTTP code = 0
fatal: The remote end hung up unexpectedly
fatal: The remote end hung up unexpectedly
Debug: Remote helper: <-
Everything up-to-date
Debug: Disconnecting.

comment:49 Changed 10 years ago by dmik

Okay, Bad address in write() comes from the crash that happens in git.exe called with the index-pack built-in command. It crashes in process in a rather strange way, with the XCPT_PROCESS_TERMINATE exception. May be related to fork.

Last edited 10 years ago by dmik (previous) (diff)

comment:50 Changed 10 years ago by dmik

I added EXCEPTQ support to git the call stack trace upon the crash but it seems that adding EXCEPTQ makes the code crash before the original crash I’m looking for. At least I get this in the .TRP file:

______________________________________________________________________

 Exception Report - created 2014/07/15 05:53:07
______________________________________________________________________

 OS2/eCS Version:  2.45
 # of Processors:  4
 Physical Memory:  3327 mb
 Virt Addr Limit:  2560 mb
 Exceptq Version:  7.10 (Mar  1 2011)

______________________________________________________________________

 Exception C0000005 - Access Violation
______________________________________________________________________

 Process:  D:\CODING\GIT\DMIK-RUN\LIBEXEC\GIT-CORE\GIT.EXE
 PID:      4164 (16740)
 TID:      02 (2)
 Priority: 200

 Filename: C:\USR\LIB\GCC473.DLL
 Address:  005B:1DDC00E3 (0001:000000E3)
 Cause:    Attempted to write to 0288FF38
           (uncommitted memory allocated by EXCEPTQ)

______________________________________________________________________

 Failing Instruction
______________________________________________________________________

 1DDC00D2  CMP EAX, 0x1000       (3d 00100000)
 1DDC00D7  LEA ECX, [ESP+0xc]    (8d4c24 0c)
 1DDC00DB  JB  0x1ddc00f2        (72 15)
 1DDC00DD  SUB ECX, 0x1000       (81e9 00100000)
 1DDC00E3 >OR  DWORD [ECX], 0x0  (8309 00)
 1DDC00E6  SUB EAX, 0x1000       (2d 00100000)
 1DDC00EB  CMP EAX, 0x1000       (3d 00100000)
 1DDC00F0  JA  0x1ddc00dd        (77 eb)

______________________________________________________________________

And this is on the screen:

Creating 4164_02.TRP

Killed by SIGSEGV
pid=0x4164 ppid=0x4163 tid=0x0002 slot=0x00a4 pri=0x0200 mc=0x0001 ps=0x0010
D:\CODING\GIT\DMIK-RUN\LIBEXEC\GIT-CORE\GIT.EXE
GCC473 0:000000e3
cs:eip=005b:1ddc00e3      ss:esp=0053:0289ff2c      ebp=0289fff4
 ds=0053      es=0053      fs=150b      gs=0000     efl=00010202
eax=0000106c ebx=00000007 ecx=0288ff38 edx=002df90c edi=002df90c esi=000a6544
Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.

comment:51 Changed 10 years ago by dmik

Seems that the above problem is caused by EXCEPTQ itself. I disabled it and could run the clone a bit longer this time (around 6 MB of repo data out of several hundred was downloaded) which ended up with this:

fatal: protocol error: bad line length character: шнГА
error: inflate: data stream error (incorrect data check)
fatal: pack has bad object at offset 8568233: inflate returned -3
fatal: index-pack failed
Debug: Remote helper quit.

шнГA is this sequence of bytes: {E8 AD 83 41} which is definitely not the line length. Perhaps the XCPT_PROCESS_TERMINATE crash I sometimes see is caused by a different example of garbage. This garbage is a clear evidence of the fact that the received packet gets corrupted somewhere. The pipeline it goes through looks as follows:

TCP/IP stack -> OpenSSL's recv -> CURL read -> Git

I did many checks of Git and CURL and still didn't find the failure which indirectly suggests that it's either OpenSSL or TCP/IP.

I will try to see what's on the wire with iptrace and what gets delivered to Git. I don't know if it will give me any hint though.

Last edited 10 years ago by dmik (previous) (diff)

comment:52 Changed 10 years ago by Steven Levine

Exceptq trap is most likely because you are using an antique exceptq on an SMP box. FWIW, the trap is in _alloca().

Last edited 10 years ago by Steven Levine (previous) (diff)

comment:53 Changed 10 years ago by dmik

Well, IPTRACE is actually of no use as all interesting packets are encrypted by OpenSSL. Doesn't give me anything. I see a big packet that looks like the POST request sent by git (judging by the destination and size) but nothing comes from the github server in reply. Not a single packet.

Steven, yes, it's written in the log file that it's verson 7.10. I really wonder why the new install has disappeared. That's perhaps because of the new RPM package — I deleted all custom EXCEPTQ DLLs (except the default eCS one) and only installed exceptq-devel afterwards that doesn't drag except in.

Anyway, with the right DLL (7.11, 3 Mar 2014) it still crashes at the same place as version 7.10 but the report is a bit different:

______________________________________________________________________

 Exception Report - created 2014/07/17 03:54:29
______________________________________________________________________

 OS2/eCS Version:  2.45
 # of Processors:  4
 Physical Memory:  3327 mb
 Virt Addr Limit:  2560 mb
 Exceptq Version:  7.11-shl (Mar  3 2014)

______________________________________________________________________

 Exception C0000005 - Access Violation
______________________________________________________________________

 Process:  D:\CODING\GIT\DMIK-RUN\LIBEXEC\GIT-CORE\GIT.EXE
 PID:      4E (78)
 TID:      02 (2)
 Priority: 200

 Filename: C:\USR\LIB\GCC473.DLL 02/05/2014 03:17:06 26,987
 Address:  005B:1DDC00E3 (0001:000000E3)
 Cause:    Attempted to write to 0288FF38
           (uncommitted memory allocated by EXCEPTQ)

______________________________________________________________________

 Failing Instruction
______________________________________________________________________

 1DDC00D2  CMP EAX, 0x1000       (3d 00100000)
 1DDC00D7  LEA ECX, [ESP+0xc]    (8d4c24 0c)
 1DDC00DB  JB  0x1ddc00f2        (72 15)
 1DDC00DD  SUB ECX, 0x1000       (81e9 00100000)
 1DDC00E3 >OR  DWORD [ECX], 0x0  (8309 00)
 1DDC00E6  SUB EAX, 0x1000       (2d 00100000)
 1DDC00EB  CMP EAX, 0x1000       (3d 00100000)
 1DDC00F0  JA  0x1ddc00dd        (77 eb)

comment:54 Changed 10 years ago by Steven Levine

OK. Did you post the entire .trp file or just the header? I'm going to need to see a process dump so that I can figure out what is calling alloca. Exceptq does not use alloc explicitly. If you don't have it, grab the latest version of pdumpctl from my os2diags page and set it up for a "normal" dump and let me know where to grab it from.

One thing worth trying unloading exceptq and moving distorm.dll out of LIBPATH and seeing what effect, if any, this has on the trap.

I did not expect the iptrace to be readable. What I would be looking for is the packet deltas and if there where and retries. This is mostly to confirm what the git/curl debug output reported.

comment:55 Changed 10 years ago by Steven Levine

Cc: Steven Levine removed

comment:56 Changed 10 years ago by Steven Levine

Cc: steve53@… added

comment:57 Changed 10 years ago by dmik

Steve, yes, that was only the header. The full version of the .TRP and the IPTRACE dump is here: http://rpm.netlabs.org/test/git-test-steve-1.zip.

I will try to remove distorm.dll when I try this next time.

comment:58 Changed 10 years ago by dmik

Steve, here ​http://rpm.netlabs.org/test/git-test-steve-2.zip is the .TRP file with distorm.dll disabled. It only differs that it complains no DISTORM or DIS386 is found.

comment:59 Changed 10 years ago by Steven Levine

The full .trp file makes is obvious that we have stack overflow.

Stack Info for Thread 02
00010000 028A0000 -> 0289FF2C -> 02890000 -> 02890000

Alloca() is touching the stack to ensure the the pages are committed so that the stack content can be accessed in any order.

ECX : 0288FF38 is outside the stack and EAX : 0000106C tells us how much stack was left to be touched when the exception triggered.

64K is a smallish stack for apps ported from *ix. However, it is the default used by pthreads. I recommend you tweak the pthread_create calls to allocate something like a 512KB stack. git does not appear to create a lot of threads, so this should not cause memory availability issues.

comment:60 Changed 10 years ago by dmik

I have updated the OpenSSL port to version 1.0.0n (the previous 1.0.0a was 4 years old) and according to my tests the problems with the POST packet have gone — the server sends the answer and git goes further. However, there is still a problem with crashing the second child git application — the reason is not yet known.

BTW, OpenSSL 1.0.1i (which is now our trunk) is also merged in but it has some problems with linking the DLL (they use some tools like ar that are currently unreliable on OS/2). I will make it build later. I created #30 for it.

Steven, good guess about the stack condition (I overlooked that in .TRP). This must be the case as I temporarily disabled the code that increased the default stack size during some of my tests. I've now reenabled it back and we will see if the EXCEPTQ problem goes away.

Changed 10 years ago by Steven Levine

Annotated ipformat output of iptrace.dmp for git-test-steve-1.zip

comment:61 Changed 10 years ago by Steven Levine

Following up on the iptrace.dmp from git-test-steve-1.zip...

It appears that a fragment reassembly error occurred. This is what packet# 126 ICMP is reporting.

It's possible that packet# 376 is related, but there's not enough trace to say for sure and IP 192.030.252.129 does not seem to be related to the transmissions that did complete without error.

If the problem persists, one could experiment with the fragttl setting and see if this has any effect.

It also might be useful to use wireshark to decode ICMP data. You can use ipformat -x to convert iptrace.dmp to something that wireshark understands.

comment:62 in reply to:  60 Changed 10 years ago by Steven Levine

Replying to dmik:

Steven, good guess about the stack condition (I overlooked that in .TRP). This must be the case as I temporarily disabled the code that increased the default stack size during some of my tests. I've now reenabled it back and we will see if the EXCEPTQ problem goes away.

In hindsight, I don't really think we had an exceptq problem per se.
More likely, loading exceptq used enough additional stack to trigger the overflow.

comment:63 Changed 10 years ago by dmik

The ECXEPTQ problem has gone indeed. So it was the lack of the stack space for it to handle some exception. The strange thing is that I don't get any crashes now at all. The (3rd) child process just terminates w/o crashing — I need to find why. BTW, the process chain when cloning from HTTP is as follows (JFTR):

  1. git.exe clone <URL>
  2. git-remote-https.exe origin <URL>
  3. git.exe fetch-pack --stateless-rpc --stdin -- lock-pack --thin --check-self-contained-and-connected --cloning --no-progress <URL>
  4. git.exe index-pack --stdin --fix-thin --keep=fetch-pack <PID_3> on <HOST> --check-self-contained-and-connected --pack-header=X,Y

Process 2 is the one that creates the HTTPS connection through Curl which opens a socket via OpenSSL. This socket is then used to read data from the remote server. This data is then flushed to Process 3 via a pipe connected to its stdin. Process 3 in turn sends some data further to Process 4 via a pipe connected to its stdin as well.

This run happens to download about 10MB and then it terminates with the following error coming from Process 2:

D:/Coding/ports/curl/curl-7.37.0/lib/select.c:488:Curl_poll: errno 14 (Bad address)

This usually happens when the child process closes its end of the pipe (e.g. by terminating). It's not clear yet for me why this happens. Quite a complex execution chain, I need to add more debugging.

I also rechecked the push test with the new OpenSSL and with iptrace enabled. And, surprisingly, after a number of failed attempts it worked! I'm not yet sure if it's OpenSSL that cured it or not because I found out that my ISP is having heavy connectivity problems at home and my internet drops off very often. Other failed attempts could be the internet connection loss as well. But when I switched to Mobile, it worked right away. I now thing that this connection problem may somehow be a reason of the clone failure. I can't tell for sure because I can't clone that much through my mobile's 3G network. I hope that IPS will fix the problem tomorrow and I will retry.

comment:64 Changed 10 years ago by dmik

Steven, thanks for annotating the IPTRACE dump and for your hints. I will try fragttl that once I restore my broadband connection. Some comments regarding what git did there. It opened a connection, exchanges some system information with the server (all succeeded). Then it tried the POST request but didn't receive any answer. It decided there was a network problem and reconnected (to the new IP since github responds on several IPs) — just to send this POST request again. The packet whose size is 1490 bytes is this packet. What puzzles me is why there are many packets of this size in the trace. I'm sure that the software generates it only two times (the first attempt and the second one). Does that mean that Curl/OpenSSL is just trying to send it over and over again for some reason? I see that after some of these packets the server sends a reply but I can't understand then why they are resent (yet).

And what do you mean by saying that packet #376 is late?

Last edited 10 years ago by dmik (previous) (diff)

comment:65 Changed 10 years ago by dmik

Okay, after resolving my ISP's connectivity issues I tested clone again this indeed turned out to be the lack of the stack! It was in the air. There is already r354 where Yuri attempted to solve a similar problem by increasing the stack size from 64K (the default for his pthread_create implementation) to 640K but that's apparently not enough for some cases. After increasing the stack to 640M (just to be sure that's enough :) I could *almost* successfully clone the giant Paul's GCC repo from the link at the top of the ticket (740 MB in size).

It failed at the very end, however, in mmap():

Debug: Remote helper: <- lock D:/Coding/1/.git/objects/pack/pack-591156e16349f58059bd987225c3c7237e78b351.keep
Debug: Remote helper: Waiting...
Debug: Remote helper: <- connectivity-ok
Debug: Remote helper: Waiting...
Debug: Remote helper: <-
trace: run_command: 'rev-list' '--objects' '--stdin' '--not' '--all'
trace: exec: 'git' 'rev-list' '--objects' '--stdin' '--not' '--all'
*** [61507:1] git.c:551:main: xcpt_rec 0x2dff48
*** [61507:1] compat/os2.c:879:git_os2_main_prepare: xcpt_rec 0x2dff48
trace: built-in: git 'rev-list' '--objects' '--stdin' '--not' '--all'
fatal: Out of memory? mmap failed: Invalid argument
Unexpected end of command stream

The big difference is that the index (the contents of .git) was left intact by git after this failure which (most likely) means that all repo data had been successfully downloaded from the server and that this failure was related not to the network operations but to the procedure of checking out a working copy from the index (where mmap failed). I will look at mmap tomorrow. This is a very good progress, anyway.

Note that the man page says that "On Linux/x86-32, the default stack size for a new thread is 2 megabytes", I guess we should do the same in the OS/2 implementation of pthread_create to be more compatible with the *nix world. Since the original git code doesn't increase the default stack size, this should be enough for normal git usage. I will talk to Yuri about that.

comment:66 Changed 10 years ago by dmik

Ok, mmap() fails because git sets its offset argument to non-zero which is not supported by our MMAP implementation. This is done when walking through the index pack file (use_pack() in sha1_file.c`).

Note that the git builds from Bauxite didn't have this problem because they were built with NO_MMAP — in this case git uses a surrogate that simply allocates a buffer in memory and reads the file contents into that buffer. And this surrogate supports the offset argument well.

I will check how difficult it is to support offset in our MMAP and then we will decide. This argument is just an offset within a file to map that defines where to start mapping the file contents from. In theory this should be easy to implement. It would be nice to have this feature in since I guess there are other apps using this that we want or may want to port.

comment:67 Changed 10 years ago by dmik

I've implemented the offset thingy in MMAP. The test case from here http://man7.org/linux/man-pages/man2/mmap.2.html works fine and I'm now trying to clone the Mozilla/GCC repos to see if it solves the problem.

This improvement required to change both WPSTK and MMAP. We have both on SVN here but Yuri actually uses his private build for MMAP — the one that is on SVN is somewhat different. I will have to commit Yuri's changes to WSTK (plus my own) to SVN to make it consistent and easy to maintain. However, this will take some time as the WPSTK repo is itself a bit messy (an import from SVN etc).

comment:68 Changed 10 years ago by dmik

The Mozilla repo has been successfully checked out! Now trying GCC (that will take longer).

comment:69 Changed 10 years ago by dmik

I congratulate everyone — the GCC repository has been successfully cloned too.

Now I have to commit all patches to all libraries that I made while making this beast work and then I will close this defect.

comment:70 Changed 10 years ago by Silvan Scherrer

just to give a short progress on that:
we are now creating all needed rpm's to have git installed by rpm.
those are:
m4 (done)
automake (done)
autoconf (done)
libtool (done)
curl (done)
expat (done)
openssl (done)
git

Last edited 10 years ago by Silvan Scherrer (previous) (diff)

comment:71 Changed 10 years ago by dmik

Yes, there has been quite a lot of package updates and quite a lot of work: besides fixing various OS/2 specific glitches all packages got their very recent versions. Many packages got their first RPM versions on OS/2 (automake, libtool, expat). There was also a fix for our mmap emulation that is now used in git (also in RPM).

Last edited 10 years ago by dmik (previous) (diff)

comment:72 Changed 10 years ago by dmik

I finally rebuilt and retested everything with the latest changes and fixes (still 2.0.0 due to #37). I also removed a bit of sava's code that isn't really needed (for the sake of clarity and ease of maintenance, see r861 and r862).

I did some local tests on the resulting build (clone, commit, push, pull) and all seems to work for me. I'm going to release an RPM now.

comment:73 Changed 10 years ago by dmik

Unfortunately, the long test of GCC showed some problems, see #38 for details. I committed a workaround in r863 and now the test passes. Will release an RPM with this workaround for the time being.

comment:74 Changed 10 years ago by dmik

Summary: git: Apply patches by komhgit: Apply patches by bauxite

The RPM is out. You may now install it with yum git. The current saga is over, I'm closing this ticket (finally).

Note that although there are several other RPM packages besides git itself (e.g. git-svn) they can't be installed at the moment due to missing perl dependencies. In particular, git-svn needs the SVN::Core perl module which is part of the subversion distribution. Our svn RPM is a package for the binary build of subversion from Paul Smedley so it lacks the perl stuff (as well as Paul's zips). Anyway, this is a separate task unrelated to git itself.

comment:75 Changed 10 years ago by dmik

Resolution: fixed
Status: newclosed

BTW, I've also renamed the 'komh' branch to 'bauxite' — to match the reality.

Note: See TracTickets for help on using tickets.