
I don't have the access to AWS CLI, and have already tried the S3 console. I clicked Buckets, and it had a Find Buckets by name, so I typed arxiv, arxiv-src, arxiv_src, arXiv, arXiv-src, arXiv_src, etc. However, this time, on the upper left corner, I found the Amazon S3 side window, which had a tab called "Buckets". One of the link leaded to, which didn't do anything but lead back to. I typed arxiv and arxiv-src on both the Search box on the upper left corner and the cloud shell, and some commands from the arxiv website didn't work.Ĭ.

After clicking around for a while, I found the cloud shell. This lead to titled Step 3: Download an object, and directed to Amazon S3 console at, but it just directed back to the. However, on the Common tasks, I found the Download an object tab. from googling did not explain anything of how to use it, neither does. The "Requester Pays buckets" on the finally made sense. I thought it was supposed to be easy like the cloud, but it was not. Issue with method 1: I then tried to download the files from the Amazon S3. That basically made the bulk download of the entire arxiv pdf bulk from an option to a necessity. Issue with method 3: I tried to download the file directly using the website addresses as suggested on Kaggle, that got me some 500 error message, and then Python handling socket.error: Connection reset by peer, even though I tried to limit the burst to a maximum of 4 requests per second as indicated on the website.Įventually, I read, that 1 article could be downloaded every 15 seconds continuously. I was able to install conda install -c conda-forge gsutil, however, as suggested on How to run Google gsutil using Python, it didn't work.

I tried from google.cloud import storage but conda install -c conda-forge google-cloud-sdk and conda install -c conda-forge google-cloud does not work, and pip install google-cloud did nothing, and the library could not used. I tried search for arxiv-dataset in the Google Cloud's website, copied gs://arxiv-dataset/arxiv/ and gsutil cp gs://arxiv-dataset/arxiv/ cloud shell, didn't work. On the Bulk access, it listed the code to access the google cloud.

Issues with method 2: The Kaggle itslef does not actually host the PDF files but the Metadata, which was useful. AWS for PDF and or (La)TeX source files.I've been trying to figure out how to download the bulk PDFs of the ArXiv and it's been over 12 hours and it was very confusing. I'm new to this area so there might be some seemingly trivial mistakes.

This post was a bit long but I wanted to show you the attempts I had tried.
