Monday, June 27, 2016

verify million files MD5 with spark

Partitioning will not be helpful in all applications—for
example, if a given RDD is scanned only once, there is no point in partitioning it in
advance. It is useful only when a dataset is reused multiple times in key-oriented
operations such as joins. 
Spark’s partitioning is available on all RDDs of key/value pairs, and causes the system
to group elements based on a function of each key.Although Spark does not give
explicit control of which worker node each key goes to (partly because the system is
designed to work even if specific nodes fail), it lets the program ensure that a set of
keys will appear together on some node.
Many of Spark’s operations involve shuffling data by key across the network. All of
these will benefit from partitioning.For operations that act on a single RDD, such as reduceByKey(), running on a prepartitioned
RDD will cause all the values for each key to be computed locally on a
single machine, requiring only the final, locally reduced value to be sent from each
worker node back to the master.





One issue to watch out for when passing functions is inadvertently serializing the
object containing the function. When you pass a function that is the member of an
object, or contains references to fields in an object (e.g., self.field), Spark sends the
entire object to worker nodes, which can be much larger than the bit of information
you need (see Example 3-19). Sometimes this can also cause your program to fail, if
your class contains objects that Python can’t figure out how to pickle.
The set of stages produced for a particular action is termed a job. In each case when
we invoke actions such as count(), we are creating a job composed of one or more
stages.
Once the stage graph is defined, tasks are created and dispatched to an internal
scheduler, which varies depending on the deployment mode being used. Stages in the
physical plan can depend on each other, based on the RDD lineage, so they will be
executed in a specific order. For instance, a stage that outputs shuffle data must occur
before one that relies on that data being present.
A physical stage will launch tasks that each do the same thing but on specific partitions
of data. Each task internally performs the same steps:
1. Fetching its input, either from data storage (if the RDD is an input RDD), an
existing RDD (if the stage is based on already cached data), or shuffle outputs.
2. Performing the operation necessary to compute RDD(s) that it represents. For
instance, executing filter() or map() functions on the input data, or performing
grouping or reduction.
3. Writing output to a shuffle, to external storage, or back to the driver (if it is the
final RDD of an action such as count()).



Choosing an output compression codec can have a big impact on future users of the
data. With distributed systems such as Spark, we normally try to read our data in
from multiple different machines. To make this possible, each worker needs to be
able to find the start of a new record. Some compression formats make this impossible,
which requires a single node to read in all of the data and thus can easily lead to a
bottleneck. Formats that can be easily read from multiple machines are called “splittable.”



While Spark’s textFile() method can handle compressed input, it
automatically disables splittable even if the input is compressed
such that it could be read in a splittable way. If you find yourself
needing to read in a large single-file compressed input, consider
skipping Spark’s wrapper and instead use either newAPIHadoopFile
or hadoopFile and specify the correct compression codec.




val sourceMD5 = sc.textFile("/home/ano/source.md5")
val  destMD5 = sc.textFile("/home/ano/oss.md5")
val source_pairs = sourceMD5.map(x => (x.split(" ")(0), x.split(" ")(2).replace("oss1","osstest")))
val dest_pairs = lines.map(x => (x.split(" ")(0), x.split(" ")(2)))
val invalids = countsource_pairs.subtract(dest_pairs ).count
每台跑满了,运行周期三天,数据1.8T
原始数据*2.5 *2(fq)
单机版的速度你可以尝试这样测试, 1.把一个稍微大的文件放到内存里,停掉单机版的服务 2.配置好job.cfg, 删掉jobName配置 3.然后seq 1 100 | xargs -I {} echo "cp job.cfg job.{}.cfg;echo jobName=local_test.{} >> job.{}.cfg" | bash 复制多个job  4.把这些job submit进去 5.启动服务,看下稳定后速度能到多少
seq 1 200 | xargs -I {} echo "cp job.cfg job.{}.cfg;echo jobName=local_test.{} >> job.{}.cfg;echo destPrefix=vpn{}/ >> job.{}.cfg;" | bash
seq 1 100 | xargs -I {} java  -jar bin/ossimport2.jar  -c conf/sys.properties submit jobs/job.{}.cfg
seq 1 100 | xargs -I {} java  -jar bin/ossimport2.jar  -c conf/sys.properties clean jobs/job.{}.cfg

Wednesday, June 22, 2016

network Tuning

For a host with a 10G NIC optimized for network paths up to 200ms RTT, and for friendlyness to single and parallel stream tools, or a 40G NIC up on paths up to 50ms RTT:

# allow testing with buffers up to 128MB
net.core.rmem_max = 134217728 
net.core.wmem_max = 134217728 
# increase Linux autotuning TCP buffer limit to 64MB
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864
# increase the length of the processor input queue
net.core.netdev_max_backlog = 250000
# recommended default congestion control is htcp 
net.ipv4.tcp_congestion_control=htcp
# recommended for hosts with jumbo frames enabled
net.ipv4.tcp_mtu_probing=1
Also add this to /etc/rc.local (where N is the number for your 10G NIC): 
    /sbin/ifconfig ethN txqueuelen 10000
    ifconfig ethX txqueuelen 300000
   ethtool -K eth3 gso on
   ethtool -k eth3


ip link show eth3


tuning NIC


You can use ethtool to check on the number of descriptors your NIC has, and whether the driver is configured to use them.  Some sample Linux output from an Intel 10GE NIC that is using the default config is below:
[user@perfsonar ~]# ethtool -g eth2
Ring parameters for eth2:
Pre-set maximums:
RX:          4096
RX Mini:     0
RX Jumbo:    0
TX:          4096
Current hardware settings:
RX:          256
RX Mini:     0
RX Jumbo:    0
TX:          256

Under Linux, to check which driver you are using, do this:
  ethtool -i eth0




If you're using e1000 chips (Intel 1GE, often integrated into motherboards; note that this does not apply to newer variants) the driver defaults to 256 Rx and 256 Tx descriptors. This is because early versions of the chipset only supported this. All recent versions have 4096, but the driver doesn't autodetect this. Increasing the number of descriptors can improve performance dramatically on some hosts.
You can also add this to /etc/rc.local to get the same result.

   ethtool -G ethN rx 4096 tx 4096



[1]https://fasterdata.es.net/host-tuning/linux/

Wednesday, June 15, 2016

Linux File System Read Write Performance Test

sudo apt-get install pktstat
sudo pktstat -i eth0 -nt
dstat -nt
dstat -N eth2,eth3
sudo apt-get install nethogs
sudo nethogs
http://www.cyberciti.biz/faq/fedora-sl-centos-redhat6-enable-epel-repo/
$ cd /tmp
wget http://mirror-fpt-telecom.fpt.net/fedora/epel/6/i386/epel-release-6-8.noarch.rpm
# rpm -ivh epel-release-6-8.noarch.rpm

How do I use EPEL repo?

Simply use the yum commands to search or install packages from EPEL repo: # yum search nethogs # yum update # yum --disablerepo="*" --enablerepo="epel" install nethogs










System administrators responsible for handling Linux servers get confused at times when they are told to benchmark a file system's performance. But the main reason that this confusion happens is because, it does not matter whatever tool you use to test the file system's performance, what matter's is the exact requirement.File system's performance depends upon certain factors as follows.

The maximum rotational speed of your hard disk
The Allocated block size of a file system
Seek Time
The performance rate of the file system's metadata
The type of read/Write
Seriously speaking its wonderful to realize that various different technologies made by different people and even different companies are working together in coordination inside a single box, and we call that box a computer. And its even more wonderful to realize that hard disk's store's almost all the information available in the world in digital format. Its a very complex thing to understand how really hard disks stores our data safely. Explaining different aspects of how a hard disk, and a file system on top of it, work together is beyond the scope of this article(But i will surely give it a try with couple of my posts about themwink)

So Lets begin our tutorial on file system benchmark test.

Its advised that during this file system performance test, you must not run any other disk I/O intensive tasks. Otherwise your results about performance will be heavily deviated. Its better to stop all other process during this test.

The Simplest Performance Test Using dd command

The simplest read write performance test in Linux can be done with the help of dd command. This command is used to write or read from any block device in Linux. And you can do a lot of stuff with this command. The main plus point with this command, is that its readily available in almost all distributions out of the box. And is pretty easy to use.

With this dd command we will only be testing sequential read and sequential write.I will test the speed of my partition /dev/sda1 which is mounted on "/" (the only partition i have on my system)so can write the data to any where in my filesystem to test.

?
1
2
3
4
5
[root@slashroot2 ~]# dd if=/dev/zero of=speetest bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.0897865 seconds, 1.2 GB/s
[root@slashroot2 ~]
In the above command you will be amazed to see that you have got 1.1GB/s. But dont be happy thats falsecheeky. Becasue the speed that dd reported to us is the speed with which data was cached to RAM memory, not to the disk. So we need to ask dd command to report the speed only after the data is synced with the disk.For that we need to run the below command.

?
1
2
3
4
[root@slashroot2 ~]# dd if=/dev/zero of=speetest bs=1M count=100 conv=fdatasync
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 2.05887 seconds, 50.9 MB/s
As you can clearly see that with the attribute fdatasync the dd command will show the status rate only after the data is completely written to the disk. So now we have the actual sequencial write speed. Lets go to an amount of data size thats larger than the RAM. Lets take 200MB of data in 64kb block size.

?
1
2
3
4
[root@slashroot2 ~]# dd if=/dev/zero of=speedtest bs=64k count=3200 conv=fdatasync
3200+0 records in
3200+0 records out
209715200 bytes (210 MB) copied, 3.51895 seconds, 59.6 MB/s


as you can clearly see that the speed came to 59 MB/s. You need to note that ext3 bydefault if you do not specify the block size, gets formatted with a block size thats determined by the programes like mke2fs . You can verify yours with the following commands.

tune2fs -l /dev/sda1

dumpe2fs /dev/sda1

For testing the sequential read speed with dd command, you need to run the below command as below.

?
1
2
3
4
[root@myvm1 sarath]# dd if=speedtest of=/dev/null bs=64k count=24000
5200+0 records in
5200+0 records out
340787200 bytes (341 MB) copied, 3.42937 seconds, 99.4 MB/s
Performance Test using HDPARM

Now lets use some other tool other than dd command for our tests. We will start with hdparm command to test the speed. Hdparm tool is also available out of the box in most of the linux distribution.

?
1
2
3
4
5
[root@myvm1 ~]# hdparm -tT /dev/sda1

/dev/sda1:
 Timing cached reads:   5808 MB in  2.00 seconds = 2908.32 MB/sec
 Timing buffered disk reads:   10 MB in  3.12 seconds =   3.21 MB/sec


There are multiple things to understand here in the above hdparm results. the -t Option will show you the speed of reading from the cache buffer(Thats why its much much higher).

The -T option will show you the speed of reading without precached buffer(which from the above output is low 3.21 MB/sec as shown above. )

the hdparm output shows you both the cached reads and disk reads separately. As mentioned before hard disk seek time also matters a lot for your speed you can check your hard disk seek time with the following linux command. seek time is the time required by the hard disk to reach the sector where the data is stored.Now lets use this seeker tool to find out the seek time by the simple seek command.

?
1
2
3
4
5
6
7
8
[root@slashroot2 ~]# seeker /dev/sda1
Seeker v3.0, 2009-06-17, http://www.linuxinsight.com/how_fast_is_your_disk.html
Benchmarking /dev/sda1 [81915372 blocks, 41940670464 bytes, 39 GB, 39997 MB, 41 GiB, 41940 MiB]
[512 logical sector size, 512 physical sector size]
[1 threads]
Wait 30 seconds..............................
Results: 87 seeks/second, 11.424 ms random access time (26606211 < offsets < 41937280284)
[root@slashroot2 ~]#
its clearly mentioned that my disk did a 86 seeks for sectors containing data per second. Thats ok for a desktop Linux machine but for servers its not at all ok.

Read Write Benchmark Test using IOZONE:

Now there is one tool out there in linux that will do all these test in one shot. Thats none other than "IOZONE". We will do some benchmark test against my /dev/sda1 with the help of iozone.Computers or servers are always purchased keeping some purpose in mind. Some servers needs to be highend performance wise, some needs to be fast in sequencial reads,and some others are ordered keeping random reads in mind. IOZONE will be very much helpful in carrying out large number of permance benchmark test against the drives. The output produced by iozone is too much brief.

The default command line option -a is used for full automatic mode, in which iozone will test block sizes ranging from 4k to 16M and file sizes ranging from 64k to 512M. Lets do a test using this -a option and see what happens.

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[root@myvm1 ~]# iozone -a /dev/sda1
             Auto Mode
        Command line used: iozone -a /dev/sda1
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
<div id="xdvp"><a href="http://www.ecocertico.com/no-credit-check-direct-lenders&#10;">creditors you never heard</a></div>
                                                            random  random    bkwd   record   stride
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
              64       4  172945  581241  1186518  1563640  877647  374157  484928   240642   985893   633901   652867 1017433  1450619
              64       8   25549  345725   516034  2199541 1229452  338782  415666   470666  1393409   799055   753110 1335973  2071017
              64      16   68231  810152  1887586  2559717 1562320  791144 1309119   222313  1421031   790115   538032  694760  2462048
              64      32  338417  799198  1884189  2898148 1733988  864568 1421505   771741  1734912  1085107  1332240 1644921  2788472
              64      64   31775  811096  1999576  3202752 1832347  385702 1421148   771134  1733146   864224   942626 2006627  3057595
             128       4  269540  699126  1318194  1525916  390257  407760  790259   154585   649980   680625   684461 1254971  1487281
             128       8  284495  837250  1941107  2289303 1420662  779975  825344   558859  1505947   815392   618235  969958  2130559
             128      16  277078  482933  1112790  2559604 1505182  630556 1560617   624143  1880886   954878   962868 1682473  2464581
             128      32  254925  646594  1999671  2845290 2100561  554291 1581773   723415  2095628  1057335  1049712 2061550  2850336
             128      64  182344  871319  2412939   609440 2249929  941090 1827150  1007712  2249754  1113206  1578345 2132336  3052578
             128     128  301873  595485  2788953  2555042 2131042  963078  762218   494164  1937294   564075  1016490 2067590  2559306
         

Note: All the output you see above are in KB/Sec

The first column shows the file size used and second column shows the length of the record used.

Lets understand the output in some of the columns

The third Column-Write:This column shows the speed Whenever a new file is made in any file system under Linux. There is more overhead involved in the metadata storing. For example the inode for the file, and its entry in the journal etc. So creating a new file in a file system is always comparatively slower than overwriting an already created file.

Fourth column-Re-writing:This shows the speed reported in overwriting the file which is already created

Fifth column-Read:This reports the speed of reading an already existing file.



seq 1 100 | xargs -I {} java  -jar bin/ossimport2.jar  -c conf/sys.properties submit jobs/job.{}.cfg

seq 1 100 | xargs -I {} java  -jar bin/ossimport2.jar  -c conf/sys.properties clean jobs/job.{}.cfg








http://www.slashroot.in/linux-file-system-read-write-performance-test

Sunday, June 5, 2016

SGE

SGE JOB history
/var/lib/gridengine/default/common$ tail -f accounting



qsub sge-root/examples/jobs/simple.sh
After the job finishes executing, check your home directory for the redirected stdout/stderr files script-name.ejob-id and script-name.ojob-id.
job-id is a consecutive unique integer number assigned to each job.


/usr/lib/lsb/remove_initd /etc/init.d/sgeexecd.cluster

./install_qmaster -m -noremote -auto util/install_modules/sge_configuration.conf

Default location of Sun Grid Engine log files:
<qmaster_spool_dir>/messages
<qmaster_spool_dir>/schedd/messages
<execd_spool_dir>/<hostname>/messages
<sge_root>/<sge_cell>/common/accounting
<sge_root>/<sge_cell>/common/statistics
add user to submit queue
$qconf -sul
$qconf -su ceshi
name    ceshi
type    ACL DEPT
fshare  0
oticket 0
entries ceshi,ano

$qconf -as master.local

$qconf -shgrp @ceshi >hosts4group
add an entry to hosts4group
$qconf -Mhgrp hosts4group
$qconf -se execd2 >execd.txt
modify execd.txt
$qconf -Me execd.txt

$qconf -sq ceshi.q > queue.txt
$qconf -Mq queue.txt



Adding an execution host
  • Make the new host an administrative host

    qconf -ah <hostname>
  • As root on this new host, run the following script from $SGE_ROOT

    install_execd

Removing an execution host
  • First, delete the queues associated with this host
  •     qconf -sql  -> list show a list of all queues    

          以下操作都是在管理节点操作。
    1.qconf -mq ceshi.q  去掉要删除节点的信息

    3. /etc/hosts去掉要删除节点的信息           
  • Delete the host
    qconf -mhgrp @ceshi    to remove entry for <hostname>
    qconf -de <hostname>
     qonf -dh <host>
  • Finally, delete the configuration for the host

    qconf -dconf <hostname>








Install On Master

sudo apt-get install gridengine-client gridengine-common gridengine-master gridengine-qmon sun-java6-jre
  1. The java is needed for it to run.
  2. In installer say not to send email
  3. set cell name to default
  4. It is important to set the master name to be your actual outside name, to find it, type
hostname -I
#this gives you an IP, then type
host thisIP
Then give the fully qualified name of your machine as host.

Install On Workers

sudo apt-get install gridengine-exec gridengine-client

Starting on Master

Try starting it inside a screen session with
sudo su
sge_qmaster
sge_execd
qmon
It will probably fail with some message like
$ sudo qmon
Warning: Cannot convert string "-adobe-courier-medium-r-*--14-*-*-*-m-*-*-*" to type FontStruct
Warning: Cannot convert string "-adobe-courier-bold-r-*--14-*-*-*-m-*-*-*" to type FontStruct
Warning: Cannot convert string "-adobe-courier-medium-r-*--12-*-*-*-m-*-*-*" to type FontStruct
X Error of failed request: BadName (named color or font does not exist)
Major opcode of failed request: 45 (X_OpenFont)
Serial number of failed request: 643
Current serial number in output stream: 654
Then install fonts:
sudo apt-get install xfonts-base xfonts-100dpi xfonts-75dpi
Then restart computer (no really!)
If you run into problems that say you cannot connect, make sure you don't have another sge process already running and that you have the fully qualified hostname set correctly
To kill possibly running instances:
ps aux | grep "sge"
To reconfigure the already installed after changing your hostname (for ubuntu in /etc/hostname) to match the one you got from running the above hostcommand:
sudo dpkg-reconfigure gridengine-master
# or to purge/remove with config files the whole install and try over clean:
sudo apt-get purge gridengine-common gridengine-exec gridengine-client
Then follow part 2 of http://scidom.wordpress.com/tag/parallel/ to use the GUI to set up your mainqueue and try their simple hello world example.
Other useful resources are: http://helms-deep.cable.nu/~rwh/blog/?p=159



sudo apt-get install gridengine-client gridengine-common gridengine-master

Creating config file /etc/default/gridengine with new version
Setting up gridengine-client (6.2u5-7.3) ...
Setting up gridengine-master (6.2u5-7.3) ...
Initializing cluster with the following parameters:
 => SGE_ROOT: /var/lib/gridengine
 => SGE_CELL: default
 => Spool directory: /var/spool/gridengine/spooldb
 => Initial manager user: sgeadmin
Initializing spool (/var/spool/gridengine/spooldb)
Initializing global configuration based on /usr/share/gridengine/default-configuration
Initializing complexes based on /usr/share/gridengine/centry
Initializing usersets based on /usr/share/gridengine/usersets
Adding user sgeadmin as a manager
Cluster creation complete

Friday, June 3, 2016

bcl2fastq v2.15 on ubuntu 14.04

Demultiplexing 

Demultiplexing =  reorganizing the FASTQ files +  generating the statistics and reporting files.

Reorganizing FASTQ Files

The first step of demultiplexing is reorganizing the base call files, based on the index
sequence. This step is done the following way for each cluster:
1 Get the raw index for each Index Read from the BCL file.
2 Identify the appropriate sample for the index based on the sample sheet.
3 Optional: Detect and correct up to two errors on the barcode, and identify the
appropriate sample. If there are multiple Index Reads, detect and correct up to two
errors in each Index Read.
4 Optional: Detect the presence of adapter sequence at the end of read. If adapter
sequence is detected, trim or mask (with N) the corresponding base calls.
5 Append the read to the appropriate new FASTQ file for each read.
6 If the index cannot be identified, the data are written into an Undetermined sample
file, unless the sample sheet specifies a sample for reads without index

Compiling bcl2fastq v2.15 on Ubuntu 12.04 and 14.04

Wed 27 August 2014 — Filed under notes; tags: linux
Illumina provides a program for demultiplexing sequencing output called bcl2fastq. They get a gold star for releasing the source - the downside is that they release binaries only for RHEL/CentOS, and no build instructions for Ubuntu. So how hard could it be?

Ubuntu 14.04 (Trusty Tahr)

I thought I'd start here since the packages are more up to date (turns out it's a good thing I did, see the morass below). There's some documentation from Illumina for compiling from source here. There's not a lot to go on, other than a list of dependencies, which boils down to:
  • zlib
  • librt
  • libpthread
  • gcc 4.1.2 (with c++)
  • boost 1.54 (with its dependencies)
  • cmake 2.8.9
Really the only tricky part was figuring out the required packages, which didn't correspond particularly well to the list of dependencies above. I didn't bother trying to install specific version of any of the dependencies other than boost 1.54.
On an Amazon AWS EC2 instance (m3.medium, ubuntu-trusty-14.04-amd64-server-20140607.1 ami-e7b8c0d7):
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install zlibc
sudo apt-get install libc6 # provides librt and libpthread
sudo apt-get install gcc
sudo apt-get install g++
sudo apt-get install libboost1.54-all-dev
sudo apt-get install cmake
From there, compilation more or less works as advertised:
wget ftp://webdata2:webdata2@ussd-ftp.illumina.com/downloads/Software/bcl2fastq/bcl2fastq2-v2.15.0.4.tar.gz
tar -xf bcl2fastq2-v2.15.0.4.tar.gz
cd bcl2fastq
mkdir build
cd build
PREFIX=/usr/local
sudo mkdir -p ${PREFIX:?}
../src/configure --prefix=${PREFIX:?}
make
sudo make install
We wanted this version to coexist with an older one, so I renamed the executable:
sudo mv $PREFIX/bin/bcl2fastq $PREFIX/bin/bcl2fastq2
syntax error at /usr/local/lib/bcl2fastq-1.8.4/perl/Casava/Alignment/Config.pm line 761, near "}" /usr/local/lib/bcl2fastq-1.8.4/perl/Casava/Alignment/Config.pm has too many errors. Compilation failed in require at /usr/local/lib/bcl2fastq-1.8.4/perl/Casava/Alignment.pm line 61.
sudo apt-get install libexpat1-dev
sudo apt-get install xsltproc
The reason for the errors is that bcl2fastq is not compatible with the default perl 5.18 of Ubuntu 14.04. You need to install an older perl version to execute the script. Use the following commands to install, e.g. 5.14, to path/perlbrew/:
cd path/perlbrew/
wget http://install.perlbrew.pl -O install_perlbrew.sh
export PERLBREW_ROOT=path/perlbrew/ && bash install_perlbrew.sh
source ./etc/bashrc
perlbrew install perl-5.14.4
perlbrew switch perl-5.14.4
perlbrew install-cpanm
cpanm XML/Simple.pm
http://nhoffman.github.io/borborygmi/compiling-bcl2fastq-on-ubuntu.html

Thursday, June 2, 2016

s3fs-fuse

Maximum file size=64GB

s3fs is stable and is being used in number of production environments, e.g., rsync backup to s3.
s3fs works with rsync! (as of svn 43) as of r152 s3fs uses x-amz-copy-source for efficient update of mode, mtime and uid/gid.


enable_content_md5 (default is disable)

verifying uploaded data without multipart by content-md5 header.

fusermount -uz  /opt/oss1



$ ossfs anno-sge /opt/oss1 -ourl=http://vpc100-oss-cn-beijing.aliyuncs.com -o multireq_max=5,use_cache=/mnt/xvdb1/tmp
$ ossfs anno-sge /opt/oss1 -ourl=http://vpc100-oss-cn-beijing.aliyuncs.com -o nomultipart,use_cache=/mnt/xvdb1/tmp


s3fs has a caching mechanism: You can enable local file caching to minimize downloads
the folder specified by use_cache (optional) a local file cache automatically maintained by s3fs, enabled with "use_cache" option, e.g., -ouse_cache=/mnt/xvdb1/tmp


s3fs supports multiparts request(send some request as parallel), I think this problem is dependent on the number of parallel requests as possible.
If you can, please try to set small value for multireq_max and parallel_count options.


  • nomultipart
    • disable multipart uploads.
  • multireq_max (default="500")
    • maximum number of parallel request for listing objects.
  • parallel_count (default="5")
    • number of parallel request for downloading/uploading large objects. s3fs uploads large object(over 20MB) by multipart post request, and sends parallel requests. This option limits parallel request count which s3fs requests at once.


https://github.com/s3fs-fuse/s3fs-fuse/issues/94

https://github.com/s3fs-fuse/s3fs-fuse/issues/152




$cat s3fs-watchdog.sh

#!/bin/bash
#
# s3fs-watchdog.sh
#
# Run from the root user's crontab to keep an eye on s3fs which should always
# be mounted.
#
# Note:  If getting the amazon S3 credentials from environment variables
#   these must be entered in the actual crontab file (otherwise use one
#   of the s3fs other ways of getting credentials).
#
# Example:  To run it once every minute getting credentials from envrironment
# variables enter this via "sudo crontab -e":
#
#   AWSACCESSKEYID=XXXXXXXXXXXXXX
#   AWSSECRETACCESSKEY=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
#   * * * * * /root/s3fs-watchdog.sh
#

NAME=ossfs
BUCKET=anno-sge
MOUNTPATH=/opt/oss1
MOUNT=/bin/mount
UMOUNT=/bin/umount
NOTIFY=whg@anno.com
NOTIFYCC=whg@anno.com
GREP=/bin/grep
PS=/bin/ps
NOP=/bin/true
DATE=/bin/date
MAIL=/usr/bin/mail
RM=/bin/rm

$PS -ef|$GREP -v grep|$GREP $NAME|grep $BUCKET >/dev/null 2>&1
case "$?" in
   0)
   # It is running in this case so we do nothing.
   $NOP
   ;;
   1)
   echo "$NAME is NOT RUNNING for bucket $BUCKET. Remounting $BUCKET with $NAME and sending notices."
   $UMOUNT $MOUNTPATH >/dev/null 2>&1
   $MOUNT $MOUNTPATH >/tmp/watchdogmount.out 2>&1
   NOTICE=/tmp/watchdog.txt
   echo "$NAME for $BUCKET was not running and was started on `$DATE`" > $NOTICE
   $MAIL -n -s "$BUCKET $NAME mount point lost and remounted" -t $NOTIFYCC $NOTIFY < $NOTICE
   $RM -f $NOTICE
   ;;
esac

exit

$cat /etc/fstab

ossfs#anno-sge  /opt/oss1 fuse _netdev,url=http://vpc100-oss-cn-beijing.aliyuncs.com,uid=1001,gid=1001,max_stat_cache_size=100000000,nomultipart,use_cache=/mnt/xvdb1/tmp,allow_other,user,exec  0 0

请确保/etc/passwd-ossfs这个文件存在,且权限为640。并且user和该文件的owner在同一个group内

$mount /opt/oss1
fusermount: failed to open /etc/fuse.conf: Permission denied

fusermount: option allow_other only allowed if 'user_allow_other' is set in /etc/fuse.conf

The problem can be easily fixed by adding the user to the fuse group then relogin:

sudo addgroup <username> fuse