Monthly Archives: January 2018

Use s3cmd to Download Requester Pays Buckets on S3

List files under pdf:

$ s3cmd ls --requester-pays s3://arxiv/pdf
                       DIR   s3://arxiv/pdf/

List files under pdf:

$ s3cmd ls --requester-pays s3://arxiv/pdf/\*
2010-07-29 19:56 526202880   s3://arxiv/pdf/arXiv_pdf_0001_001.tar
2010-07-29 20:08 138854400   s3://arxiv/pdf/arXiv_pdf_0001_002.tar
2010-07-29 20:14 525742080   s3://arxiv/pdf/arXiv_pdf_0002_001.tar
2010-07-29 20:33 156743680   s3://arxiv/pdf/arXiv_pdf_0002_002.tar
2010-07-29 20:38 525731840   s3://arxiv/pdf/arXiv_pdf_0003_001.tar
2010-07-29 20:52 187607040   s3://arxiv/pdf/arXiv_pdf_0003_002.tar
2010-07-29 20:58 525731840   s3://arxiv/pdf/arXiv_pdf_0004_001.tar
2010-07-29 21:11  44851200   s3://arxiv/pdf/arXiv_pdf_0004_002.tar
2010-07-29 21:14 526305280   s3://arxiv/pdf/arXiv_pdf_0005_001.tar
2010-07-29 21:27 234711040   s3://arxiv/pdf/arXiv_pdf_0005_002.tar
...

Get all files under pdf:

$ s3cmd get --requester-pays s3://arxiv/pdf/\*

List all content to text file:

$ s3cmd ls --requester-pays s3://arxiv/src/\* > all_files.txt

Calculate file size:

$ awk '{s += $3} END { print "sum is", s/1000000000, "GB, average is", s/NR }' all_files.txt
sum is 844.626 GB, average is 4.80447e+08

Install Tsunami UDP on CentOS 7

Install dependencies:

$ yum install cvs git gcc automake autoconf libtool -y

Download Tsunami UDP:

$ cd /tmp
$ cvs -z3 -d:pserver:anonymous@tsunami-udp.cvs.sourceforge.net:/cvsroot/tsunami-udp co -P tsunami-udp
$ cd tsunami-udp
$ ./recompile.sh
$ make install

Then on the server side:

$ tsunamid --port 46224 * # (Serves all files from current directory for copy)

On the client side:

$ tsunami connect <server_ip> get *

Transfer dataset back to S3:

aws s3 cp --recursive /mnt/bigephemeral s3://<your-new-bucket>/

Limitations:

  • Tsunami UDP transfers only files and doesn’t do directories/ subdirectories, we need to tar them all up as one single tar file (additional storage capacity needs to be taken into consideration).
  • Multi-threading is not supported.
  • Multi session not supported. Client supports only one connection to the server at a time. No parallel file transfer.
  • No resume or retry for file transfer.
  • Does not support Native encryption.

Refs: