Cloud 1 : A Performance Goal Oriented Processor
Allocation Technique for Centralized Heterogeneous Multi-cluster Environments
This paper proposes a processor allocation technique named temporal look-ahead processor allocation (TLPA) that makes allocation decision by evaluating the allocation effects on subsequent jobs in the waiting queue. TLPA has two strengths. First, it takes multiple performance factors into account when making allocation decision. Second, it can be used to optimize different performance metrics. To evaluate the performance of TLPA, we compare TLPA with best-fit and fastest-first algorithms. Simulation results show that TLPA has up to 32.75% performance improvement over conventional processor allocation algorithms in terms of average turnaround time in various system configurations.
Cloud
2 : Addressing Resource Fragmentation in Grids through
Network-Aware Meta-scheduling in Advance
Grids are made of heterogeneous computing resources geographically dispersed where providing Quality of Service (QoS) is a challenging task. One way of enhancing the QoS perceived by users is by performing scheduling of jobs in advance, since reservations of resources are not always possible. This way, it becomes more likely that the appropriate resources are available to run the job when needed. One drawback of this scenario is that fragmentation appears as a well known effect in job allocations into resources and becomes the cause for poor resource utilization. So, a new technique has been developed to tackle fragmentation problems, which consists of rescheduling already scheduled tasks. To this end, some heuristics are implemented to calculate the intervals to be replanned and to select the jobs involved in the process. Moreover, another heuristic is implemented to put rescheduled jobs as close together as possible to minimize the fragmentation. This technique has been tested using a real test bed.
Cloud
3 : APP: Minimizing Interference Using Aggressive
Pipelined Prefetching in Multi-level Buffer Caches
As services become more complex with multiple interactions, and storage servers are shared by multiple services, the different I/O streams arising from these multiple services compete for disk attention. Aggressive Pipelined Prefetching (APP) enabled storage clients are designed to manage the buffer cache and I/O streams to minimize the disk I/O-interference arising from competing streams. Due to the large number of streams serviced by a storage server, most of the disk time is spent seeking, leading to degradation in response times. The goal of APP is to decrease application execution time by increasing the throughput of individual I/O streams and utilizing idle capacity on remote nodes along with idle network times thus effectively avoiding alternating bursts of activity followed by periods of inactivity. APP significantly increases overall I/O throughput and decreases overall messaging overhead between servers. In APP, the intelligence is embedded in the clients and they automatically infer parameters in order to achieve the maximum throughput. APP clients make use of aggressive prefetching and data offloading to remote buffer caches in multi-level buffer cache hierarchies in an effort to minimize disk interference and tranquilize the effects of aggressive prefetching. We used an extremely I/O-intensive Radix-k application employed in studies on the scalability of parallel image composition and particle tracing developed at the Argonne National Laboratory with data sets of up to 128 GB and implemented our scheme on a 16-node Linux cluster. We observed that the execution time of the application decreased by 68% on average when using our scheme.
Cloud
4 : ASDF: An Autonomous and Scalable Distributed File
System
The demand for huge storage space on data-intensive applications and high-performance scientific computing continues to grow. To integrate massive distributed storage resources for providing huge storage space is an important and challenging issue in Cloud and Grid computing. In this paper, we propose a distributed file system, called ASDF, to meet the demands of not only data-intensive applications but also end users, developers and administrators. While sharing many of the same goals as previous distributed file systems such as scalability, reliability, and performance, it is also designed with the emphasis on compatibility, extensibility and autonomy. With the design goals in minds, we address several issues and present our design by adopting peer-to-peer technology, replication, multi-source data transfer, metadata caching and service-oriented architecture. The experimental results show the proposed distributed file system meet our design goals and will be useful in Cloud and Grid computing.
Cloud
5 : Assertion Based Parallel Debugging
Programming languages have advanced tremendously over the years, but program debuggers have hardly changed. Sequential debuggers do little more than allow a user to control the flow of a program and examine its state. Parallel ones support the same operations on multiple processes, which are adequate with a small number of processors, but become unwieldy and ineffective on very large machines. Typical scientific codes have enormous multi-dimensional data structures and it is impractical to expect a user to view the data using traditional display techniques. In this paper we discuss the use of debug-time assertions, and show that these can be used to debug parallel programs. The techniques reduce the debugging complexity because they reason about the state of large arrays without requiring the user to know the expected value of every element. Assertions can be expensive to evaluate, but their performance can be improved by running them in parallel. We demonstrate the system with a case study finding errors in a parallel version of the Shallow Water Equations, and evaluate the performance of the tool on a 4,096 cores Cray XE6.
Cloud
6 : Autonomic SLA-Driven Provisioning for Cloud
Applications
Significant achievements have been made for automated allocation of cloud resources. However, the performance of applications may be poor in peak load periods, unless their cloud resources are dynamically adjusted. Moreover, although cloud resources dedicated to different applications are virtually isolated, performance fluctuations do occur because of resource sharing, and software or hardware failures (e.g. unstable virtual machines, power outages, etc.). In this paper, we propose a decentralized economic approach for dynamically adapting the cloud resources of various applications, so as to statistically meet their SLA performance and availability goals in the presence of varying loads or failures. According to our approach, the dynamic economic fitness of a Web service determines whether it is replicated or migrated to another server, or deleted. The economic fitness of a Web service depends on its individual performance constraints, its load, and the utilization of the resources where it resides. Cascading performance objectives are dynamically calculated for individual tasks in the application workflow according to the user requirements. By fully implementing our framework, we experimentally proved that our adaptive approach statistically meets the performance objectives under peak load periods or failures, as opposed to static resource settings.
Cloud
7 : BAR: An Efficient Data Locality Driven Task Scheduling
Algorithm for Cloud Computing
Large scale data processing is increasingly common in cloud computing systems like MapReduce, Hadoop, and Dryad in recent years. In these systems, files are split into many small blocks and all blocks are replicated over several servers. To process files efficiently, each job is divided into many tasks and each task is allocated to a server to deals with a file block. Because network bandwidth is a scarce resource in these systems, enhancing task data locality(placing tasks on servers that contain their input blocks) is crucial for the job completion time. Although there have been many approaches on improving data locality, most of them either are greedy and ignore global optimization, or suffer from high computation complexity. To address these problems, we propose a heuristic task scheduling algorithm called Balance-Reduce(BAR), in which an initial task allocation will be produced at first, then the job completion time can be reduced gradually by tuning the initial task allocation. By taking a global view, BAR can adjust data locality dynamically according to network state and cluster workload. The simulation results show that BAR is able to deal with large problem instances in a few seconds and outperforms previous related algorithms in term of the job completion time.
Cloud
8 : Characterizing the Performance of Parallel
Applications on Multi-socket Virtual Machines
In this paper we characterize the behavior with respect to memory locality management of scientific computing applications running in virtualized environments. NUMA locality on current solutions (KVM and Xen) is enforced by pinning virtual machines to CPUs and providing NUMA aware allocation in hyper visors. Our analysis shows that due to two-level memory management and lack of integration with page reclamation mechanisms, applications running on warm VMs suffer from a ``leakage'' of page locality. Our results using MPI, UPC and Open MP implementations of the NAS Parallel Benchmarks, running on Intel and AMD NUMA systems, indicate that applications observe an overall average performance degradation of 55% when compared to native. Runs on ``cold'' VMs suffer an average performance degradation of 27%, while subsequent runs are roughly 30% slower than the cold runs. We quantify the impact of locality improvement techniques designed for full virtualization environments: hyper visor level page remapping and partitioning the NUMA domains between multiple virtual machines. Our analysis shows that hyper visor only schemes have little or no potential for performance improvement. When the programming model allows it, system partitioning with proper VM and runtime support is able to re-produce native performance: in a partitioned system with one virtual machine per socket the average workload performance is 5% better than native.
Cloud
9 : Classification and Composition of QoS
Attributes in Distributed, Heterogeneous Systems
In large-scale distributed systems the selection of services and data sources to respond to a given request is a crucial task. Non-functional or Quality of Service (QoS) attributes need to be considered when there are several candidate services with identical functionality. Before applying any service selection optimization strategy, the system has to be analyzed in terms of QoS metrics, comparable to the statistics needed by a database query optimizer. This paper presents a classification approach for QoS attributes of system components, from which aggregation functions for composite services are derived. The applicability and usefulness of the approach is shown in a distributed system from a High-Energy Physics experiment posing a complex service selection challenge.
Cloud
10 : Cloud computing in Aircraft Data Network
The introduction of data networks within an aircraft has created several service opportunities for the air carriers. Using the available Internet connectivity, the carriers could offer services like Video-on-Demand (VoD), Voice-over-IP (VoIP), and gaming-on-demand within the aircraft. One of the major road blocks in implementing any of these services is the additional hardware and software requirements. Each service requires dedicated hardware resources to run appropriate software components. It is not possible to accommodate every hardware component within the aircraft due to space, power, and ventilation restrictions. Also, it is economically not viable to install and maintain hardware components for every aircraft. One solution is to use cloud computing. Cloud computing is a recent innovation that is helping the computing industry in distributed computing. Cloud computing allows the organizations to consolidate several hardware resources into one physical device. The Cloud computing concept helps organizations in reducing the overall power consumption and maintenance costs. The cloud computing concept could be extended to the Aircraft Data Network environment with every aircraft subscribing to the cloud resources to run their non mission-critical applications. In this paper, the authors explore the possibility of using cloud services for Aircraft Data Networks. The authors evaluate the performance issues involved with the aircraft mobility and dynamic resource transfer between servers when the aircraft's point-of-attachment changes. The authors predict that using cloud computing concepts would encourage many carriers to offer new services within the aircraft.
Cloud
11 : Dealing with Grid-Computing Authorization Using
Identity-Based Certificateless Proxy Signature
In this paper, we propose a new Identity-Based Certificateless Proxy Signature scheme, for the grid environment, in order to enable attribute-based authorization, fine-grained delegation and enhanced delegation chain establishment and validation, all without relying on any kind of PKI Certificates or proxy certificates. We show that our scheme is correct and secure. We also give an evaluation of the computational and communication overhead of the proposed scheme. Simulations shows satisfying results.
Cloud
12 : Enabling Multi-physics Coupled Simulations within the
PGAS Programming Framework
Complex coupled multi-physics simulations are playing increasingly important roles in scientific and engineering applications such as fusion plasma and climate modeling. At the same time, extreme scales, high levels of concurrency and the advent of multicore and many core technologies are making the high-end parallel computing systems on which these simulations run, hard to program. While the Partitioned Global Address Space (PGAS) languages is attempting to address the problem, the PGAS model does not easily support the coupling of multiple application codes, which is necessary for the coupled multi-physics simulations. Furthermore, existing frameworks that support coupled simulations have been developed for fragmented programming models such as message passing, and are conceptually mismatched with the shared memory address space abstraction in the PGAS programming model. This paper explores how multi-physics coupled simulations can be supported within the PGAS programming framework. Specifically, in this paper, we present the design and implementation of the XpressSpace programming system, which enables efficient and productive development of coupled simulations across multiple independent PGAS Unified Parallel C (UPC) executables. XpressSpace provides the global-view style programming interface that is consistent with the memory model in UPC, and provides an efficient runtime system that can dynamically capture the data decomposition of global-view arrays and enable fast exchange of parallel data structures between coupled codes. In addition, XpressSpace provides the flexibility to define the coupling process in specification file that is independent of the program source codes. We evaluate the performance and scalability of Xpress Space prototype implementation using different coupling patterns extracted from real world multi-physics simulation scenarios, on the Jaguar Cray XT5 system of Oak Ridge National Laboratory.
Cloud
13 : EZTrace: A Generic Framework for Performance
Analysis
Modern supercomputers with multi-core nodes enhanced by accelerators, as well as hybrid programming models introduce more complexity in modern applications. Exploiting efficiently all the resources requires a complex analysis of the performance of applications in order to detect time-consuming sections. We present eztrace, a generic trace generation framework that aims at providing a simple way to analyze applications. eztrace is based on plugins that allow it to trace different programming models such as MPI, pthread or OpenMP as well as user-defined libraries or applications. eztrace uses two steps: one to collect the basic information during execution and one post-mortem analysis. This permits tracing the execution of applications with low overhead while allowing to refine the analysis after the execution. We also present a script language for eztrace that gives the user the opportunity to easily define the functions to instrument without modifying the source code of the application.
Cloud
14 : Implementing Trust in Cloud Infrastructures
Today's cloud computing infrastructures usually require customers who transfer data into the cloud to trust the providers of the cloud infrastructure. Not every customer is willing to grant this trust without justification. It should be possible to detect that at least the configuration of the cloud infrastructure -- as provided in the form of a hyper visor and administrative domain software -- has not been changed without the customer's consent. We present a system that enables periodical and necessity-driven integrity measurements and remote attestations of vital parts of cloud computing infrastructures. Building on the analysis of several relevant attack scenarios, our system is implemented on top of the Xen Cloud Platform and makes use of trusted computing technology to provide security guarantees. We evaluate both security and performance of this system. We show how our system attests the integrity of a cloud infrastructure and detects all changes performed by system administrators in a typical software configuration, even in the presence of a simulated denial-of-service attack.
Cloud
15 : Improving Utilization of Infrastructure Clouds
A key advantage of infrastructure-as-a-service (IaaS) clouds is providing users on-demand access to resources. To provide on-demand access, however, cloud providers must either significantly overprovision their infrastructure (and pay a high price for operating resources with low utilization) or reject a large proportion of user requests (in which case the access is no longer on-demand). At the same time, not all users require truly on-demand access to resources. Many applications and workflows are designed for recoverable systems where interruptions in service are expected. For instance, many scientists utilize high-throughput computing (HTC)-enabled resources, such as Condor, where jobs are dispatched to available resources and terminated when the resource is no longer available. We propose a cloud infrastructure that combines on-demand allocation of resources with opportunistic provisioning of cycles from idle cloud nodes to other processes by deploying backfill virtual machines (VMs). For demonstration and experimental evaluation, we extend the Nimbus cloud computing toolkit to deploy backfill VMs on idle cloud nodes for processing an HTC workload. Initial tests show an increase in IaaS cloud utilization from 37.5% to 100% during a portion of the evaluation trace but only 6.39% overhead cost for processing the HTC workload. We demonstrate that a shared infrastructure between IaaS cloud providers and an HTC job management system can be highly beneficial to both the IaaS cloud provider and HTC users by increasing the utilization of the cloud infrastructure (thereby decreasing the overall cost) and contributing cycles that would otherwise be idle to processing HTC jobs.
Cloud
16 : Inferring Network Topologies in Infrastructure as a
Service Cloud
Infrastructure as a Service (IaaS) clouds are gaining increasing popularity as a platform for distributed computations. The virtualization layers of those clouds offer new possibilities for rapid resource provisioning, but also hide aspects of the underlying IT infrastructure which have often been exploited in classic cluster environments. One of those hidden aspects is the network topology, i.e. the way the rented virtual machines are physically interconnected inside the cloud. We propose an approach to infer the network topology connecting a set of virtual machines in IaaS clouds and exploit it for data-intensive distributed applications. Our inference approach relies on delay-based end-to-end measurements and can be combined with traditional IP-level topology information, if available. We evaluate the inference accuracy using the popular hyper visors KVM as well as XEN and highlight possible performance gains for distributed applications.
Cloud
17 : MPI-IO/Gfarm: An Optimized
Implementation of MPI-IO for the Gfarm File System
This paper proposes a design and implementation of an MPI-IO implementation of the Gfarm file system, called MPI-IO/Gfarm. The Gfarm file system is a global file system that federates the local storage of compute nodes among several clusters. It has a scale-out architecture designed to support distributed data-intensive computing. However Gfarm file system does not achieve scalable performance in the case of parallel writes to a single file, a typical file operation in MPI-IO. This paper proposes an optimization technique to improve the parallel write performance to a single file. In the evaluation, MPI-IO/Gfarm achieves scalable parallel I/O performance.
Cloud
18 : Multiple Services Throughput Optimization in a
Hierarchical Middleware
Accessing the power of distributed resources can nowadays easily be done using a middleware based on a client/server approach. Several architectures exist for those middleware's. The most scalable ones rely on a hierarchical design. Determining the best shape for the hierarchy, the one giving the best throughput of services, is not an easy task. We first propose a computation and communication model for such hierarchical middleware. Our model takes into account the deployment of several services in the hierarchy. Then, based on this model, we propose algorithms for automatically constructing a hierarchy on two kinds of heterogeneous platforms: communication homogeneous/computation heterogeneous platforms, and fully heterogeneous platforms. The proposed algorithms aim at offering the users the best obtained to requested throughput ratio, while providing fairness on this ratio for the different kinds of services, and using as few resources as possible for the hierarchy. For each kind of platforms, we compare our model with experimental results on a real middleware called DIET (Distributed Interactive Engineering Toolbox).
Cloud
19 : Network-Friendly One-Sided Communication through Multinode Cooperation on Petascale
Cray XT5 Systems
One-sided communication is important to enable asynchronous communication and data movement for Global Address Space (GAS) programming models. Such communication is typically realized through direct messages between initiator and target processes. For peta scale systems with 10,000s of nodes and 100,000s of cores, these direct messages require dedicated communication buffers and/or channels, which can lead to significant scalability challenges for GAS programming models. In this paper, we describe a network-friendly communication model, multinode cooperation, to enable indirect one-sided communication. Compute nodes work together to handle one-side requests through (1) request forwarding in which one node can intercept a request and forward it to a target node, and (2) request aggregation in which one node can aggregate many requests to a target node. We have implemented multinode cooperation for a popular GAS runtime library, Aggregate Remote Memory Copy Interface (ARMCI). Our experimental results on a large scale Cray XT5 system demonstrate that multinode cooperations able to greatly increase memory scalability by reducing communication buffers required on each node. In addition, multinode cooperation improves the resiliency of GAS runtime system to network contention. Furthermore, multinode cooperation can benefit the performance of scientific applications. In one case, it reduces the total execution time of an NWChem application by 52%.
Cloud
20 : On the Performance Variability of Production Cloud
Services
Cloud computing is an emerging infrastructure paradigm that promises to eliminate the need for companies to maintain expensive computing hardware. Through the use of virtualization and resource time-sharing, clouds address with a single set of physical resources a large user base with diverse needs. Thus, clouds have the potential to provide their owners the benefits of an economy of scale and, at the same time, become an alternative for both the industry and the scientific community to self-owned clusters, grids, and parallel production environments. For this potential to become reality, the first generation of commercial clouds need to be proven to be dependable. In this work we analyze the dependability of cloud services. Towards this end, we analyze long-term performance traces from Amazon Web Services and Google App Engine, currently two of the largest commercial clouds in production. We find that the performance of about half of the cloud services we investigate exhibits yearly and daily patterns, but also that most services have periods of especially stable performance. Last, through trace-based simulation we assess the impact of the variability observed for the studied cloud services on three large-scale applications, job execution in scientific computing, virtual goods trading in social networks, and state management in social gaming. We show that the impact of performance variability depends on the application, and give evidence that performance variability can be an important factor in cloud provider selection.
Cloud
21 : On the Relation between Congestion Control, Switch
Arbitration and Fairness
In loss less interconnection networks such as InfiniBand, congestion control (CC) can be an effective mechanism to achieve high performance and good utilization of network resources. The InfiniBand standard describes CC functionality for detecting and resolving congestion, but the design decisions on how to implement this functionallity is left to the hardware designer. One must be cautious when making these design decisions not to introduce fairness problems, as our study shows. In this paper we study the relationship between congestion control, switch arbitration, and fairness. Specifically, we look at fairness among different traffic flows arriving at a hot spot switch on different input ports, as CC is turned on. In addition we study the fairness among traffic flows at a switch where some flows are exclusive users of their input ports while other flows are sharing an input port (the parking lot problem). Our results show that the implementation of congestion control in a switch is vulnerable to unfairness if care is not taken. In detail, we found that a threshold hysteresis of more than one MTU is needed to resolve arbitration unfairness. Furthermore, to fully solve the parking lot problem, proper configuration of the CC parameters are required.
Cloud
22 : Open Social Based Collaborative Science Gateways
In data-driven science projects, researchers distributed in different institutions often wish to easily team up for data and computing resource sharing to address challenging scientific problems. Typical VO based authorization schemes is not suitable for such a user organized scientific collaboration. Using the emerging OAuthprotocol, we introduce a novel group authorization scheme to support ad-hoc team formation and user controlled resource sharing. Integrating this group authorization scheme, we define an Open Social based scientific collaboration framework and develop a science gateway prototype named as Open Life Science Gateway (OLSGW) to verify and refine the framework. Our experience with development of the OLSGW shows that OAuth 2.0 based group authorization scheme is avery promising approach to resource sharing in Cloud environments, and the Open Social based framework can facilitate science gateway developers to create domain-specific collaborative applications in a very flexible way.
Cloud
23 : Optimized Management of Power and Performance for
Virtualized Heterogeneous Server Clusters
This paper proposes and evaluates an approach for power and performance management in virtualized server clusters. The major goal of our approach is to reduce power consumption in the cluster while meeting performance requirements. The contributions of this paper are: (1) a simple but effective way of modeling power consumption and capacity of servers even under heterogeneous and changing workloads, and (2) an optimization strategy based on a mixed integer programming model for achieving improvements on power-efficiency while providing performance guarantees in the virtualized cluster. In the optimization model, we address application workload balancing and the often ignored switching costs due to frequent and undesirable turning servers on/off and VM relocations. We show the effectiveness of the approach applied to a server cluster test bed. Our experiments show that our approach conserves about 50% of the energy required by a system designed for peak workload scenario, with little impact on the applications' performance goals. Also, by using prediction in our optimization strategy, further QoS improvement was achieved.
Cloud
24 : PAC-PLRU: A Cache Replacement Policy to Salvage
Discarded Predictions from Hardware Prefetchers
Cache replacement policy plays an important role in guaranteeing the availability of cache blocks, reducing miss rates, and improving applications' overall performance. However, recent research efforts on improving replacement policies require either significant additional hardware or major modifications to the organization of the existing cache. In this study, we propose the PAC-PLRU cache replacement policy. PAC-PLRU not only utilizes but also judiciously salvages the prediction information discarded from a widely-adopted stride prefetcher. The main idea behind PAC-PLRU is utilizing the prediction results generated by the existing stride prefetcher and preventing these predicted cache blocks from being replaced in the near future. Experimental results show that leveraging the PAC-PLRU with a stride prefetcher reduces the average L2 cache miss rate by 91% over a baseline system with only PLRU policy, and by 22% over a system using PLRU with an unconnected stride prefetcher. Most importantly, PAC-PLRU only requires minor modifications to existing cache architecture to get these benefits. The proposed PAC-PLRU policy is promising in fostering the connection between prefetching and replacement policies, and have a lasting impact on improving the overall cache performance.
Cloud
25 : Performance under Failures of MapReduce
Applications
The MapReduce programming paradigm is gaining more and more popularity in recent years due to its ability in supporting easy programming, data distribution, as well as fault tolerance. Failure is an unwanted but inevitable fact that all large-scale parallel computing systems have to face with. MapReduce introduces a novel data replication and task reexecution strategy for fault tolerance. This study intends to lead a better understanding of such fault tolerance mechanisms. In particular, we build a stochastic performance model to quantify the impact of failures on MapReduce applications and to investigate its effectiveness under different computing environments. Simulations also have been carried out to verify the accuracy of the proposed model. Our results show that data replication is an effective approach even when failure rate is high, and the task migration mechanism of MapReduce works well in balancing the reliability difference among individual nodes. This work provides a theoretical foundation for optimizing large-scale MapReduce applications, especially when fault tolerance is the concern.
Cloud 26 : Small Discrete Fourier Transforms on GPUs
Abstract
Efficient implementations of the Discrete
Fourier Transform (DFT) for GPUs provide good performance with large data
sizes, but are not competitive with CPU code for small data sizes. On the other
hand, several applications perform multiple DFTs on small data sizes. In fact,
even algorithms for large data sizes use a divide-and-conquer approach, where
eventually small DFTs need to be performed. We discuss our DFT implementation,
which is efficient for multiple small DFTs. One feature of our implementation
is the use of the asymptotically slow matrix multiplication approach for small
data sizes, which improves performance on the GPU due to its regular memory
access and computational patterns. We combine this algorithm with the mixed
radix algorithm for 1-D, 2-D, and 3-D complex DFTs. We also demonstrate the
effect of different optimization techniques. When GPUs are used to accelerate a
component of an application running on the host, it is important that decisions
taken to optimize the GPU performance not affect the performance of the rest of
the application on the host. One feature of our implementation is that we use a
data layout that is not optimal for the GPU so that the overall effect on the
application is better. Our implementation performs up to two orders of
magnitude faster than cuFFT on an NVIDIA GeForce 9800 GTX GPU and up to one to two orders of
magnitude faster than FFTW on a CPU for multiple small DFTs. Furthermore, we
show that our implementation can accelerate the performance of a Quantum Monte
Carlo application for which cuFFT is not effective.
The primary contributions of this work lie in demonstrating the utility of the
matrix multiplication approach and also in providing an implementation that is
efficient for small DFTs when a GPU is used to accelerate an application
running on the host.
Cloud
27 : Sophia: Local Trust for Securing Routing in DHTs
Distributed Hash Tables (DHTs) have been used as a common building block in many distributed applications, including Cloud and Grid. However, there are still important security vulnerabilities that hinder their adoption in today's large-scale computing platforms. For instance, routing vulnerabilities have been a subject of intensive research but existing solutions rely on redundancy in lieu of improving the quality of routing paths. In this paper, we present Sophia, a novel generic security technique which combines iterative routing with local trust to fortify routing in DHTs. Sophia strictly benefits from first-hand observations about the success/failure of a node's own lookups to improve forwarding paths. Moreover, unlike redundant routing, Sophia dynamically protects routing without introducing additional network overhead. To the best of our knowledge, this is the first work which exploits a local trust system to fortify routing in DHTs. We compared the performance of Sophia with redundant routing in Kademlia DHT. We obtained significant improvements regarding routing resilience, self-adjustment and network traffic reduction.
Cloud
28 : Supporting Federated Multi-authority Security Models
The JISC-funded Shintau project has produced an extension to the Shibboleth profile which allows a user to link information from more than one IdP together utilising a custom Linking Service (LS). This paper describes both the application and independent evaluation of this software by the Nationale-Science Centre (NeSC) at the University of Glasgow within the context of the ESRC-funded Data Management through e-Social Science (DAMES) project.
Cloud
29 : The Grid Observatory
The goal of the Grid Observatory project (GO) is to contribute to an experimental theory of large grid systems by integrating the collection of data on the behaviour of the flagship European Grid Infrastructure (EGI) and its users, the development of models, and an ontology for the domain knowledge. The GO gives access to a database of grid usage traces available to the wider computer science community without the need of grid credentials. The paper presents the architecture of the digital curation process enacted by the GO and examples of their exploitation.
Cloud
30 : A Performance Goal Oriented Processor Allocation
Technique for Centralized Heterogeneous Multi-cluster Environments
This paper proposes a processor allocation technique named temporal look-ahead processor allocation (TLPA) that makes allocation decision by evaluating the allocation effects on subsequent jobs in the waiting queue. TLPA has two strengths. First, it takes multiple performance factors into account when making allocation decision. Second, it can be used to optimize different performance metrics. To evaluate the performance of TLPA, we compare TLPA with best-fit and fastest-first algorithms. Simulation results show that TLPA has up to 32.75% performance improvement over conventional processor allocation algorithms in terms of average turnaround time in various system configurations.
Cloud
31 : Addressing Resource Fragmentation in Grids through
Network-Aware Meta-scheduling in Advance
Grids are made of heterogeneous computing resources geographically dispersed where providing Quality of Service (QoS) is a challenging task. One way of enhancing the QoS perceived by users is by performing scheduling of jobs in advance, since reservations of resources are not always possible. This way, it becomes more likely that the appropriate resources are available to run the job when needed. One drawback of this scenario is that fragmentation appears as a well known effect in job allocations into resources and becomes the cause for poor resource utilization. So, a new technique has been developed to tackle fragmentation problems, which consists of rescheduling already scheduled tasks. To this end, some heuristics are implemented to calculate the intervals to be replanned and to select the jobs involved in the process. Moreover, another heuristic is implemented to put rescheduled jobs as close together as possible to minimize the fragmentation. This technique has been tested using a real test bed.
Cloud
32 : APP: Minimizing Interference Using Aggressive
Pipelined Prefetching in Multi-level Buffer Caches
As services become more complex with multiple interactions, and storage servers are shared by multiple services, the different I/O streams arising from these multiple services compete for disk attention. Aggressive Pipelined Prefetching (APP) enabled storage clients are designed to manage the buffer cache and I/O streams to minimize the disk I/O-interference arising from competing streams. Due to the large number of streams serviced by a storage server, most of the disk time is spent seeking, leading to degradation in response times. The goal of APP is to decrease application execution time by increasing the throughput of individual I/O streams and utilizing idle capacity on remote nodes along with idle network times thus effectively avoiding alternating bursts of activity followed by periods of inactivity. APP significantly increases overall I/O throughput and decreases overall messaging overhead between servers. In APP, the intelligence is embedded in the clients and they automatically infer parameters in order to achieve the maximum throughput. APP clients make use of aggressive prefetching and data offloading to remote buffer caches in multi-level buffer cache hierarchies in an effort to minimize disk interference and tranquilize the effects of aggressive prefetching. We used an extremely I/O-intensive Radix-k application employed in studies on the scalability of parallel image composition and particle tracing developed at the Argonne National Laboratory with data sets of up to 128 GB and implemented our scheme on a 16-node Linux cluster. We observed that the execution time of the application decreased by 68% on average when using our scheme.
Cloud
33 : ASDF: An Autonomous and Scalable Distributed File
System
The demand for huge storage space on data-intensive applications and high-performance scientific computing continues to grow. To integrate massive distributed storage resources for providing huge storage space is an important and challenging issue in Cloud and Grid computing. In this paper, we propose a distributed file system, called ASDF, to meet the demands of not only data-intensive applications but also end users, developers and administrators. While sharing many of the same goals as previous distributed file systems such as scalability, reliability, and performance, it is also designed with the emphasis on compatibility, extensibility and autonomy. With the design goals in minds, we address several issues and present our design by adopting peer-to-peer technology, replication, multi-source data transfer, metadata caching and service-oriented architecture. The experimental results show the proposed distributed file system meet our design goals and will be useful in Cloud and Grid computing.
Cloud
34 : Assertion Based Parallel Debugging
Programming languages have advanced tremendously over the years, but program debuggers have hardly changed. Sequential debuggers do little more than allow a user to control the flow of a program and examine its state. Parallel ones support the same operations on multiple processes, which are adequate with a small number of processors, but become unwieldy and ineffective on very large machines. Typical scientific codes have enormous multi-dimensional data structures and it is impractical to expect a user to view the data using traditional display techniques. In this paper we discuss the use of debug-time assertions, and show that these can be used to debug parallel programs. The techniques reduce the debugging complexity because they reason about the state of large arrays without requiring the user to know the expected value of every element. Assertions can be expensive to evaluate, but their performance can be improved by running them in parallel. We demonstrate the system with a case study finding errors in a parallel version of the Shallow Water Equations, and evaluate the performance of the tool on a 4,096 cores Cray XE6.
Cloud
36 : Autonomic SLA-Driven Provisioning for Cloud
Applications
Significant achievements have been made for automated allocation of cloud resources. However, the performance of applications may be poor in peak load periods, unless their cloud resources are dynamically adjusted. Moreover, although cloud resources dedicated to different applications are virtually isolated, performance fluctuations do occur because of resource sharing, and software or hardware failures (e.g. unstable virtual machines, power outages, etc.). In this paper, we propose a decentralized economic approach for dynamically adapting the cloud resources of various applications, so as to statistically meet their SLA performance and availability goals in the presence of varying loads or failures. According to our approach, the dynamic economic fitness of a Web service determines whether it is replicated or migrated to another server, or deleted. The economic fitness of a Web service depends on its individual performance constraints, its load, and the utilization of the resources where it resides. Cascading performance objectives are dynamically calculated for individual tasks in the application workflow according to the user requirements. By fully implementing our framework, we experimentally proved that our adaptive approach statistically meets the performance objectives under peak load periods or failures, as opposed to static resource settings.
Cloud
37 : BAR: An Efficient Data Locality Driven Task Scheduling
Algorithm for Cloud Computing
Large scale data processing is increasingly common in cloud computing systems like MapReduce, Hadoop, and Dryad in recent years. In these systems, files are split into many small blocks and all blocks are replicated over several servers. To process files efficiently, each job is divided into many tasks and each task is allocated to a server to deals with a file block. Because network bandwidth is a scarce resource in these systems, enhancing task data locality(placing tasks on servers that contain their input blocks) is crucial for the job completion time. Although there have been many approaches on improving data locality, most of them either are greedy and ignore global optimization, or suffer from high computation complexity. To address these problems, we propose a heuristic task scheduling algorithm called Balance-Reduce(BAR), in which an initial task allocation will be produced at first, then the job completion time can be reduced gradually by tuning the initial task allocation. By taking a global view, BAR can adjust data locality dynamically according to network state and cluster workload. The simulation results show that BAR is able to deal with large problem instances in a few seconds and outperforms previous related algorithms in term of the job completion time.
Cloud
38 : Characterizing the Performance of Parallel
Applications on Multi-socket Virtual Machines
In this paper we characterize the behavior with respect to memory locality management of scientific computing applications running in virtualized environments. NUMA locality on current solutions (KVM and Xen) is enforced by pinning virtual machines to CPUs and providing NUMA aware allocation in hyper visors. Our analysis shows that due to two-level memory management and lack of integration with page reclamation mechanisms, applications running on warm VMs suffer from a ``leakage'' of page locality. Our results using MPI, UPC and Open MP implementations of the NAS Parallel Benchmarks, running on Intel and AMD NUMA systems, indicate that applications observe an overall average performance degradation of 55% when compared to native. Runs on ``cold'' VMs suffer an average performance degradation of 27%, while subsequent runs are roughly 30% slower than the cold runs. We quantify the impact of locality improvement techniques designed for full virtualization environments: hyper visor level page remapping and partitioning the NUMA domains between multiple virtual machines. Our analysis shows that hyper visor only schemes have little or no potential for performance improvement. When the programming model allows it, system partitioning with proper VM and runtime support is able to re-produce native performance: in a partitioned system with one virtual machine per socket the average workload performance is 5% better than native.
Cloud
39 : Classification and Composition of QoS
Attributes in Distributed, Heterogeneous Systems
In large-scale distributed systems the selection of services and data sources to respond to a given request is a crucial task. Non-functional or Quality of Service (QoS) attributes need to be considered when there are several candidate services with identical functionality. Before applying any service selection optimization strategy, the system has to be analyzed in terms of QoS metrics, comparable to the statistics needed by a database query optimizer. This paper presents a classification approach for QoS attributes of system components, from which aggregation functions for composite services are derived. The applicability and usefulness of the approach is shown in a distributed system from a High-Energy Physics experiment posing a complex service selection challenge.
Cloud
40 : Cloud computing in Aircraft Data Network
The introduction of data networks within an aircraft has created several service opportunities for the air carriers. Using the available Internet connectivity, the carriers could offer services like Video-on-Demand (VoD), Voice-over-IP (VoIP), and gaming-on-demand within the aircraft. One of the major road blocks in implementing any of these services is the additional hardware and software requirements. Each service requires dedicated hardware resources to run appropriate software components. It is not possible to accommodate every hardware component within the aircraft due to space, power, and ventilation restrictions. Also, it is economically not viable to install and maintain hardware components for every aircraft. One solution is to use cloud computing. Cloud computing is a recent innovation that is helping the computing industry in distributed computing. Cloud computing allows the organizations to consolidate several hardware resources into one physical device. The Cloud computing concept helps organizations in reducing the overall power consumption and maintenance costs. The cloud computing concept could be extended to the Aircraft Data Network environment with every aircraft subscribing to the cloud resources to run their non mission-critical applications. In this paper, the authors explore the possibility of using cloud services for Aircraft Data Networks. The authors evaluate the performance issues involved with the aircraft mobility and dynamic resource transfer between servers when the aircraft's point-of-attachment changes. The authors predict that using cloud computing concepts would encourage many carriers to offer new services within the aircraft.
Cloud
41 : Dealing with Grid-Computing Authorization Using
Identity-Based Certificateless Proxy Signature
In this paper, we propose a new Identity-Based Certificateless Proxy Signature scheme, for the grid environment, in order to enable attribute-based authorization, fine-grained delegation and enhanced delegation chain establishment and validation, all without relying on any kind of PKI Certificates or proxy certificates. We show that our scheme is correct and secure. We also give an evaluation of the computational and communication overhead of the proposed scheme. Simulations shows satisfying results.
Cloud
42 : Enabling Multi-physics Coupled Simulations within the
PGAS Programming Framework
Complex coupled multi-physics simulations are playing increasingly important roles in scientific and engineering applications such as fusion plasma and climate modeling. At the same time, extreme scales, high levels of concurrency and the advent of multicore and many core technologies are making the high-end parallel computing systems on which these simulations run, hard to program. While the Partitioned Global Address Space (PGAS) languages is attempting to address the problem, the PGAS model does not easily support the coupling of multiple application codes, which is necessary for the coupled multi-physics simulations. Furthermore, existing frameworks that support coupled simulations have been developed for fragmented programming models such as message passing, and are conceptually mismatched with the shared memory address space abstraction in the PGAS programming model. This paper explores how multi-physics coupled simulations can be supported within the PGAS programming framework. Specifically, in this paper, we present the design and implementation of the XpressSpace programming system, which enables efficient and productive development of coupled simulations across multiple independent PGAS Unified Parallel C (UPC) executables. XpressSpace provides the global-view style programming interface that is consistent with the memory model in UPC, and provides an efficient runtime system that can dynamically capture the data decomposition of global-view arrays and enable fast exchange of parallel data structures between coupled codes. In addition, XpressSpace provides the flexibility to define the coupling process in specification file that is independent of the program source codes. We evaluate the performance and scalability of Xpress Space prototype implementation using different coupling patterns extracted from real world multi-physics simulation scenarios, on the Jaguar Cray XT5 system of Oak Ridge National Laboratory.
Cloud
43 : EZTrace: A Generic Framework for Performance
Analysis
Modern supercomputers with multi-core nodes enhanced by accelerators, as well as hybrid programming models introduce more complexity in modern applications. Exploiting efficiently all the resources requires a complex analysis of the performance of applications in order to detect time-consuming sections. We present eztrace, a generic trace generation framework that aims at providing a simple way to analyze applications. eztrace is based on plugins that allow it to trace different programming models such as MPI, pthread or OpenMP as well as user-defined libraries or applications. eztrace uses two steps: one to collect the basic information during execution and one post-mortem analysis. This permits tracing the execution of applications with low overhead while allowing to refine the analysis after the execution. We also present a script language for eztrace that gives the user the opportunity to easily define the functions to instrument without modifying the source code of the application.
Cloud
44 : Implementing Trust in Cloud Infrastructures
Today's cloud computing infrastructures usually require customers who transfer data into the cloud to trust the providers of the cloud infrastructure. Not every customer is willing to grant this trust without justification. It should be possible to detect that at least the configuration of the cloud infrastructure -- as provided in the form of a hyper visor and administrative domain software -- has not been changed without the customer's consent. We present a system that enables periodical and necessity-driven integrity measurements and remote attestations of vital parts of cloud computing infrastructures. Building on the analysis of several relevant attack scenarios, our system is implemented on top of the Xen Cloud Platform and makes use of trusted computing technology to provide security guarantees. We evaluate both security and performance of this system. We show how our system attests the integrity of a cloud infrastructure and detects all changes performed by system administrators in a typical software configuration, even in the presence of a simulated denial-of-service attack.
Cloud
45: Improving Utilization of Infrastructure
Clouds
A key advantage of infrastructure-as-a-service (IaaS) clouds is providing users on-demand access to resources. To provide on-demand access, however, cloud providers must either significantly overprovision their infrastructure (and pay a high price for operating resources with low utilization) or reject a large proportion of user requests (in which case the access is no longer on-demand). At the same time, not all users require truly on-demand access to resources. Many applications and workflows are designed for recoverable systems where interruptions in service are expected. For instance, many scientists utilize high-throughput computing (HTC)-enabled resources, such as Condor, where jobs are dispatched to available resources and terminated when the resource is no longer available. We propose a cloud infrastructure that combines on-demand allocation of resources with opportunistic provisioning of cycles from idle cloud nodes to other processes by deploying backfill virtual machines (VMs). For demonstration and experimental evaluation, we extend the Nimbus cloud computing toolkit to deploy backfill VMs on idle cloud nodes for processing an HTC workload. Initial tests show an increase in IaaS cloud utilization from 37.5% to 100% during a portion of the evaluation trace but only 6.39% overhead cost for processing the HTC workload. We demonstrate that a shared infrastructure between IaaS cloud providers and an HTC job management system can be highly beneficial to both the IaaS cloud provider and HTC users by increasing the utilization of the cloud infrastructure (thereby decreasing the overall cost) and contributing cycles that would otherwise be idle to processing HTC jobs.
Cloud
46 : Inferring Network Topologies in Infrastructure as a
Service Cloud
Infrastructure as a Service (IaaS) clouds are gaining increasing popularity as a platform for distributed computations. The virtualization layers of those clouds offer new possibilities for rapid resource provisioning, but also hide aspects of the underlying IT infrastructure which have often been exploited in classic cluster environments. One of those hidden aspects is the network topology, i.e. the way the rented virtual machines are physically interconnected inside the cloud. We propose an approach to infer the network topology connecting a set of virtual machines in IaaS clouds and exploit it for data-intensive distributed applications. Our inference approach relies on delay-based end-to-end measurements and can be combined with traditional IP-level topology information, if available. We evaluate the inference accuracy using the popular hyper visors KVM as well as XEN and highlight possible performance gains for distributed applications.
Cloud
47 : MPI-IO/Gfarm: An Optimized
Implementation of MPI-IO for the Gfarm File System
This paper proposes a design and implementation of an MPI-IO implementation of the Gfarm file system, called MPI-IO/Gfarm. The Gfarm file system is a global file system that federates the local storage of compute nodes among several clusters. It has a scale-out architecture designed to support distributed data-intensive computing. However Gfarm file system does not achieve scalable performance in the case of parallel writes to a single file, a typical file operation in MPI-IO. This paper proposes an optimization technique to improve the parallel write performance to a single file. In the evaluation, MPI-IO/Gfarm achieves scalable parallel I/O performance.
Cloud
48 : Multiple Services Throughput Optimization in a
Hierarchical Middleware
Accessing the power of distributed resources can nowadays easily be done using a middleware based on a client/server approach. Several architectures exist for those middleware's. The most scalable ones rely on a hierarchical design. Determining the best shape for the hierarchy, the one giving the best throughput of services, is not an easy task. We first propose a computation and communication model for such hierarchical middleware. Our model takes into account the deployment of several services in the hierarchy. Then, based on this model, we propose algorithms for automatically constructing a hierarchy on two kinds of heterogeneous platforms: communication homogeneous/computation heterogeneous platforms, and fully heterogeneous platforms. The proposed algorithms aim at offering the users the best obtained to requested throughput ratio, while providing fairness on this ratio for the different kinds of services, and using as few resources as possible for the hierarchy. For each kind of platforms, we compare our model with experimental results on a real middleware called DIET (Distributed Interactive Engineering Toolbox).
Cloud
49 : Network-Friendly One-Sided Communication through Multinode Cooperation on Petascale
Cray XT5 Systems
One-sided communication is important to enable asynchronous communication and data movement for Global Address Space (GAS) programming models. Such communication is typically realized through direct messages between initiator and target processes. For peta scale systems with 10,000s of nodes and 100,000s of cores, these direct messages require dedicated communication buffers and/or channels, which can lead to significant scalability challenges for GAS programming models. In this paper, we describe a network-friendly communication model, multinode cooperation, to enable indirect one-sided communication. Compute nodes work together to handle one-side requests through (1) request forwarding in which one node can intercept a request and forward it to a target node, and (2) request aggregation in which one node can aggregate many requests to a target node. We have implemented multinode cooperation for a popular GAS runtime library, Aggregate Remote Memory Copy Interface (ARMCI). Our experimental results on a large scale Cray XT5 system demonstrate that multinode cooperations able to greatly increase memory scalability by reducing communication buffers required on each node. In addition, multinode cooperation improves the resiliency of GAS runtime system to network contention. Furthermore, multinode cooperation can benefit the performance of scientific applications. In one case, it reduces the total execution time of an NWChem application by 52%.
Cloud
50 : On the Performance Variability of Production Cloud
Services
Cloud computing is an emerging infrastructure paradigm that promises to eliminate the need for companies to maintain expensive computing hardware. Through the use of virtualization and resource time-sharing, clouds address with a single set of physical resources a large user base with diverse needs. Thus, clouds have the potential to provide their owners the benefits of an economy of scale and, at the same time, become an alternative for both the industry and the scientific community to self-owned clusters, grids, and parallel production environments. For this potential to become reality, the first generation of commercial clouds need to be proven to be dependable. In this work we analyze the dependability of cloud services. Towards this end, we analyze long-term performance traces from Amazon Web Services and Google App Engine, currently two of the largest commercial clouds in production. We find that the performance of about half of the cloud services we investigate exhibits yearly and daily patterns, but also that most services have periods of especially stable performance. Last, through trace-based simulation we assess the impact of the variability observed for the studied cloud services on three large-scale applications, job execution in scientific computing, virtual goods trading in social networks, and state management in social gaming. We show that the impact of performance variability depends on the application, and give evidence that performance variability can be an important factor in cloud provider selection.
Cloud
51 : On the Relation between Congestion Control, Switch
Arbitration and Fairness
In loss less interconnection networks such as InfiniBand, congestion control (CC) can be an effective mechanism to achieve high performance and good utilization of network resources. The InfiniBand standard describes CC functionality for detecting and resolving congestion, but the design decisions on how to implement this functionallity is left to the hardware designer. One must be cautious when making these design decisions not to introduce fairness problems, as our study shows. In this paper we study the relationship between congestion control, switch arbitration, and fairness. Specifically, we look at fairness among different traffic flows arriving at a hot spot switch on different input ports, as CC is turned on. In addition we study the fairness among traffic flows at a switch where some flows are exclusive users of their input ports while other flows are sharing an input port (the parking lot problem). Our results show that the implementation of congestion control in a switch is vulnerable to unfairness if care is not taken. In detail, we found that a threshold hysteresis of more than one MTU is needed to resolve arbitration unfairness. Furthermore, to fully solve the parking lot problem, proper configuration of the CC parameters are required.
Cloud
52 : Open Social Based Collaborative Science Gateways
In data-driven science projects, researchers distributed in different institutions often wish to easily team up for data and computing resource sharing to address challenging scientific problems. Typical VO based authorization schemes is not suitable for such a user organized scientific collaboration. Using the emerging OAuthprotocol, we introduce a novel group authorization scheme to support ad-hoc team formation and user controlled resource sharing. Integrating this group authorization scheme, we define an Open Social based scientific collaboration framework and develop a science gateway prototype named as Open Life Science Gateway (OLSGW) to verify and refine the framework. Our experience with development of the OLSGW shows that OAuth 2.0 based group authorization scheme is avery promising approach to resource sharing in Cloud environments, and the Open Social based framework can facilitate science gateway developers to create domain-specific collaborative applications in a very flexible way.
Cloud
53 : Optimized Management of Power and Performance for
Virtualized Heterogeneous Server Clusters
This paper proposes and evaluates an approach for power and performance management in virtualized server clusters. The major goal of our approach is to reduce power consumption in the cluster while meeting performance requirements. The contributions of this paper are: (1) a simple but effective way of modeling power consumption and capacity of servers even under heterogeneous and changing workloads, and (2) an optimization strategy based on a mixed integer programming model for achieving improvements on power-efficiency while providing performance guarantees in the virtualized cluster. In the optimization model, we address application workload balancing and the often ignored switching costs due to frequent and undesirable turning servers on/off and VM relocations. We show the effectiveness of the approach applied to a server cluster test bed. Our experiments show that our approach conserves about 50% of the energy required by a system designed for peak workload scenario, with little impact on the applications' performance goals. Also, by using prediction in our optimization strategy, further QoS improvement was achieved.
Cloud
54 : PAC-PLRU: A Cache Replacement Policy to Salvage
Discarded Predictions from Hardware Prefetchers
Cache replacement policy plays an important role in guaranteeing the availability of cache blocks, reducing miss rates, and improving applications' overall performance. However, recent research efforts on improving replacement policies require either significant additional hardware or major modifications to the organization of the existing cache. In this study, we propose the PAC-PLRU cache replacement policy. PAC-PLRU not only utilizes but also judiciously salvages the prediction information discarded from a widely-adopted stride prefetcher. The main idea behind PAC-PLRU is utilizing the prediction results generated by the existing stride prefetcher and preventing these predicted cache blocks from being replaced in the near future. Experimental results show that leveraging the PAC-PLRU with a stride prefetcher reduces the average L2 cache miss rate by 91% over a baseline system with only PLRU policy, and by 22% over a system using PLRU with an unconnected stride prefetcher. Most importantly, PAC-PLRU only requires minor modifications to existing cache architecture to get these benefits. The proposed PAC-PLRU policy is promising in fostering the connection between prefetching and replacement policies, and have a lasting impact on improving the overall cache performance.
Cloud
55 : Performance under Failures of MapReduce
Applications
The MapReduce programming paradigm is gaining more and more popularity in recent years due to its ability in supporting easy programming, data distribution, as well as fault tolerance. Failure is an unwanted but inevitable fact that all large-scale parallel computing systems have to face with. MapReduce introduces a novel data replication and task reexecution strategy for fault tolerance. This study intends to lead a better understanding of such fault tolerance mechanisms. In particular, we build a stochastic performance model to quantify the impact of failures on MapReduce applications and to investigate its effectiveness under different computing environments. Simulations also have been carried out to verify the accuracy of the proposed model. Our results show that data replication is an effective approach even when failure rate is high, and the task migration mechanism of MapReduce works well in balancing the reliability difference among individual nodes. This work provides a theoretical foundation for optimizing large-scale MapReduce applications, especially when fault tolerance is the concern.
Cloud 56 : Small Discrete Fourier Transforms on GPUs
Abstract
Efficient implementations of the Discrete
Fourier Transform (DFT) for GPUs provide good performance with large data
sizes, but are not competitive with CPU code for small data sizes. On the other
hand, several applications perform multiple DFTs on small data sizes. In fact,
even algorithms for large data sizes use a divide-and-conquer approach, where
eventually small DFTs need to be performed. We discuss our DFT implementation,
which is efficient for multiple small DFTs. One feature of our implementation is
the use of the asymptotically slow matrix multiplication approach for small
data sizes, which improves performance on the GPU due to its regular memory
access and computational patterns. We combine this algorithm with the mixed
radix algorithm for 1-D, 2-D, and 3-D complex DFTs. We also demonstrate the
effect of different optimization techniques. When GPUs are used to accelerate a
component of an application running on the host, it is important that decisions
taken to optimize the GPU performance not affect the performance of the rest of
the application on the host. One feature of our implementation is that we use a
data layout that is not optimal for the GPU so that the overall effect on the
application is better. Our implementation performs up to two orders of
magnitude faster than cuFFT on an NVIDIA GeForce 9800 GTX GPU and up to one to two orders of
magnitude faster than FFTW on a CPU for multiple small DFTs. Furthermore, we
show that our implementation can accelerate the performance of a Quantum Monte
Carlo application for which cuFFT is not effective.
The primary contributions of this work lie in demonstrating the utility of the
matrix multiplication approach and also in providing an implementation that is
efficient for small DFTs when a GPU is used to accelerate an application
running on the host.
Cloud
57 : Sophia: Local Trust for Securing Routing in DHTs
Distributed Hash Tables (DHTs) have been used as a common building block in many distributed applications, including Cloud and Grid. However, there are still important security vulnerabilities that hinder their adoption in today's large-scale computing platforms. For instance, routing vulnerabilities have been a subject of intensive research but existing solutions rely on redundancy in lieu of improving the quality of routing paths. In this paper, we present Sophia, a novel generic security technique which combines iterative routing with local trust to fortify routing in DHTs. Sophia strictly benefits from first-hand observations about the success/failure of a node's own lookups to improve forwarding paths. Moreover, unlike redundant routing, Sophia dynamically protects routing without introducing additional network overhead. To the best of our knowledge, this is the first work which exploits a local trust system to fortify routing in DHTs. We compared the performance of Sophia with redundant routing in Kademlia DHT. We obtained significant improvements regarding routing resilience, self-adjustment and network traffic reduction.
Cloud
58 : Supporting Federated Multi-authority Security Models
The JISC-funded Shintau project has produced an extension to the Shibboleth profile which allows a user to link information from more than one IdP together utilising a custom Linking Service (LS). This paper describes both the application and independent evaluation of this software by the Nationale-Science Centre (NeSC) at the University of Glasgow within the context of the ESRC-funded Data Management through e-Social Science (DAMES) project.
Cloud
59 : The Grid Observatory
The goal of the Grid Observatory project (GO) is to contribute to an experimental theory of large grid systems by integrating the collection of data on the behaviour of the flagship European Grid Infrastructure (EGI) and its users, the development of models, and an ontology for the domain knowledge. The GO gives access to a database of grid usage traces available to the wider computer science community without the need of grid credentials. The paper presents the architecture of the digital curation process enacted by the GO and examples of their exploitation.
Cloud
60 : A Flexible Policy Framework for the QoS Differentiated Provisioning of Services
We propose a policy-based framework for the QoS differentiated provisioning of services. The proposed frame-work improves the state-of-the-art in policy-based preference specification by combining cardinal and ordinal preferences. We describe the underlying models, focussing on the key features and contributions of the proposed framework. We also show how, using our framework, the QoS evaluation problem can be translated to a Constraint Satisfaction Problem while preserving the semantics of the preference policies.
Cloud
61 : Towards Real-Time, Volunteer Distributed Computing
Many large-scale distributed computing applications demand real-time responses by soft deadlines. To enable such real-time task distribution and execution on the volunteer resources, we previously proposed the design of the real-time volunteer computing platform called RT-BOINC. The system gives low O(1) worst-case execution time for task management operations, such as task scheduling, state transitioning, and validation. In this work, we present a full implementation RT-BOINC, adding new features including deadline timer and parameter-based admission control. We evaluate RT-BOINC at large scale using two real-time applications, namely, the games Go and Chess. The results of our case study show that RT-BOINC provides much better performance than the original BOINC in terms of average and worst-case response time, scalability and efficiency.
Cloud
62 : Towards Reliable, Performant
Workflows for Streaming-Applications on Cloud Platforms
Scientific workflows are commonplace in eScience applications. Yet, the lack of integrated support for data models, including streaming data, structured collections and files, is limiting the ability of workflows to support emerging applications in energy informatics that are stream oriented. This is compounded by the absence of Cloud data services that support reliable and performant streams. In this paper, we propose and present a scientific workflow framework that supports streams as first-class data, and is optimized for performant and reliable execution across desktop and Cloud platforms. The workflow framework features and its empirical evaluation on a private Eucalyptus cloud are presented.
Cloud
63 : Unifying Cloud Management: Towards Overall Governance
of Business Level Objectives
We address the challenge of providing unified cloud resource management towards an overall business level objective, given the multitude of managerial tasks to be performed and the complexity of any architecture to support them. Resource level management tasks include elasticity control, virtual machine and data placement, autonomous fault management, etc, which are intrinsically difficult problems since services normally have unknown lifetime and capacity demands that varies largely over time. To unify the management of these problems, (for optimization with respect to some higher level business level objective, like optimizing revenue while breaking no more than a certain percentage of service level agreements)becomes even more challenging as the resource level managerial challenges are far from independent. After providing the general problem formulation, we review recent approaches taken by the research community, including mainly general autonomic computing technology for large-scale environments and resource level management tools equipped with some business oriented or otherwise qualitative features. We propose and illustrate a policy-driven approach where a high-level management system monitors overall system and services behavior and adjusts lower level policies (e.g., thresholds for admission control, elasticity control, server consolidation level, etc) for optimization towards the measurable business level objectives.
Cloud
64 : Enabling
Public Auditability and Data Dynamics for Storage
Security in Cloud Computing
Cloud
Computing has been envisioned as the next-generation
architecture of IT Enterprise. It moves the application software and databases to the centralized large data centers, where the management of the data and services may not be
fully trustworthy. This unique paradigm brings about many new security challenges, which have not been well understood.
This work studies the problem of ensuring the integrity of data
storage in Cloud Computing. In particular, we consider the task of allowing a third
party auditor (TPA), on behalf of the cloud client,
to verify the integrity of the dynamic data stored in the cloud. The introduction of TPA eliminates the involvement
of the client through the auditing of whether his data
stored in the cloud are
indeed intact, which can be important in achieving
economies of scale for Cloud
Computing. The support for
data dynamics via the
most general forms of data operation, such as block
modification, insertion, and deletion, is also a
significant step toward practicality, since services in
Cloud Computing are not
limited to archive or backup data only. While prior
works on ensuring remote data integrity often lacks
the support of either public auditability or dynamic data operations, this paper achieves both. We first
identify the difficulties and potential security problems of direct extensions with fully dynamic data updates from prior
works and then show how to construct an elegant
verification scheme for the seamless integration of
these two salient features in our protocol design. In particular, to achieve efficient data
dynamics, we improve the existing proof of storage models by manipulating the classic Merkle Hash Tree construction for
block tag authentication. To support efficient handling of multiple auditing
tasks, we further explore the technique of bilinear aggregate signature to
extend our main result into a multiuser setting, where TPA can perform multiple
auditing tasks simultaneously. Extensive security and performance analysis show that the proposed schemes
are highly efficient and provably secure
Cloud
65 : Multicloud Deployment of Computing Clusters for Loosely Coupled
MTC Applications
Cloud computing is
gaining acceptance in many IT organizations, as an elastic, flexible, and
variable-cost way to deploy their service platforms using outsourced resources.
Unlike traditional utilities where a single provider scheme is a common
practice, the ubiquitous access to cloud resources easily enables the
simultaneous use of different clouds. In this paper,
we explore this scenario to deploy a computing cluster on the top of a multicloud infrastructure, for solving loosely coupled Many-Task Computing (MTC) applications. In this way,
the cluster nodes can be provisioned with resources
from different clouds to improve the cost effectiveness of
the deployment, or to implement high-availability
strategies. We prove the viability of this kind of solutions by evaluating the scalability, performance,
and cost of different configurations of a Sun Grid Engine cluster,
deployed on a multicloud
infrastructure spanning a local data center and three different cloud sites:
Amazon EC2 Europe, Amazon EC2 US, and ElasticHosts.
Although the testbed deployed in this work is limited
to a reduced number of computing
resources (due to hardware and budget limitations), we have complemented our
analysis with a simulated infrastructure model, which includes a larger number of resources, and runs larger problem sizes. Data obtained
by simulation show that performance and cost results can be extrapolated to
large-scale problems and cluster infrastructures.
Cloud
66 : Exploiting
Dynamic Resource Allocation for Efficient Parallel Data Processing in the Cloud
In
recent years ad hoc parallel data
processing has emerged to be one of the killer applications for
Infrastructure-as-a-Service (IaaS) clouds. Major Cloud computing
companies have started to integrate frameworks for parallel data processing in their product
portfolio, making it easy for customers to access
these services and to deploy their programs. However, the
processing frameworks which are currently used have
been designed for static, homogeneous cluster setups
and disregard the particular nature of a cloud. Consequently, the
allocated compute resources may be inadequate for big parts of the submitted
job and unnecessarily increase processing time and
cost. In this paper, we discuss the
opportunities and challenges for efficient parallel data processing in clouds and present our
research project Nephele. Nephele
is the first data processing framework to explicitly exploit the dynamic resource allocation offered by
today's IaaS clouds for both, task scheduling and execution. Particular tasks
of a processing job can be assigned to different
types of virtual machines which are automatically instantiated and terminated
during the job execution. Based on this new
framework, we perform extended evaluations of MapReduce-inspired
processing jobs on an IaaS
cloud system and compare the
results to the popular data
processing framework Hadoop.
Cloud 67: Robust Execution of Service Workflows Using Redundancy and Advance
Reservations
In this paper, we develop a novel algorithm that
allows service consumers to execute business
processes (or workflows) of
interdependent services in a dependable manner
within tight time-constraints. In particular, we consider large interorganizational service-oriented
systems, where services are offered by external organizations
that demand financial remuneration and where their
use has to be negotiated in advance using explicit service-level
agreements (as is common in Grids and cloud
computing). Here, different providers often offer the same type of service at varying levels of quality and price.
Furthermore, some providers may be less trustworthy than others, possibly
failing to meet their agreements. To control this unreliability and ensure end-to-end dependability while maximizing the
profit obtained from completing a business process, our algorithm automatically
selects the most suitable providers. Moreover, unlike existing work, it reasons
about the dependability properties of a workflow, and it controls these
by using service redundancy for critical tasks and
by planning for contingencies. Finally, our algorithm reserves services for only parts of its workflow at any time, in order to retain flexibility when
failures occur. We show empirically that our algorithm consistently outperforms
existing approaches, achieving up to a 35-fold increase in profit and successfully completing most workflows,
even when the majority of providers fail.