Devashree Tripathy
Assistant Professor at IIT Bhubaneswar
Associate at Harvard University


Harvard John A. Paulson School Of Engineering And Applied Sciences
Harvard University
Boston, MA 02134

Email: devashreetripathy [at] iitbbs [dot] ac [dot] in

Google Scholar Publons Scopus dblp Linked in CV

I am actively recruiting PhD students at Indian Institute of Technology (IIT) Bhubaneswar !


I am looking for motivated, and curious students eager to pursue ground-breaking research by designing power-efficient, scalable, and high performance computing systems. My research bridges systems, computer architecture, and machine learning to design high-performance and efficient computing platforms!
As a member of my lab you will be a part of highly collaborative and social research community. If you are interested in joining my research group, I strongly encourage you to apply to the IIT Bhubaneswar PhD program. The link to the advertisement for PhD admission in CSE (Computer Science and Engg, School of Electrical Sciences), IIT Bhubaneswar. Please also email me (devashreetripathy@iitbbs.ac.in) with a copy of your CV, transcript, and brief summary of research areas that you are excited about!



Biosketch

Dr. Devashree Tripathy is an Assistant Professor in Computer Science and Engineering at Indian Institute of Technology, Bhubaneswar. Her research spans in Computer Architecture , Machine Learning, Hardware/Software codesign of AI systems, Energy-Efficient computing, Sustainable AI, TinyML, and Systems for Healthcare, Drones, Autonomous Cars and Robotics, Mobile and Edge Computing. Her current research focuses on ML for Systems and Systems for ML. Prior to joining IIT Bhubaneswar, Dr. Tripathy was a Postdoctoral Fellow in Computer Science at Harvard University in Harvard Architecture, Circuits, and Compilers Group. She is currently working as an Associate with Harvard University. She graduated from University of California, Riverside with PhD in Computer Science.



News

May, 2024
Invited to join Microsoft Research India Academic Summit 2024.
March, 2024
Invited to serve on the Program Committee (PC) for MICRO 2024 . Consider submitting your best works.
Feb, 2024
Delivered talk as invited speaker in a workshop on National Supercomputing Mission (NSM)-Department of Science and Technology (DST) Sponsored Workshop on High Performance Computing (HPC) in Engineering Applications at the Department of CSE, NIT Warangal.
Jan, 2024
Delivered talk as invited speaker in a workshop on Present and Future Computing Systems at the Department of Computer Science and Automation (CSA) at Indian Institute of Science (IISc), Bangalore. The workshop mainly focused on encouraging students pursuing bachelors to choose a research career in the field of computer systems.
July, 2023
Delivered talk as invited guest speaker in the RBI 8th, 9th and 10th workshop on "See yourself in Cyber". This was conducted for the senior officials of Reserve Bank of India (RBI) .
June, 2023
Delivered talk as invited guest speaker (Presentation Topic 1: HPC for Finance Applications, Topic 2: IoT and its applications) in the RBI 7th workshop on "See yourself in Cyber". This was conducted for the senior officials of Reserve Bank of India (RBI) i.e. India's central bank and regulatory body responsible for regulation of the Indian banking system.
March, 2023
Our Paper on Evaluating Machine Learning Algorithms for Architecture Design to appear in International Symposium on Computer Architecture (ISCA 2023), Congratulations Sri!.
Jan, 2023
Delivered talk as invited guest speaker (Presentation Title: Designing Efficient Heterogeneous Systems using Machine Learning) in the "Eminent Persons Talk" series at GEC, Bhubaneswar.
November, 2022
I have joined Indian Institute of Technology (IIT) Bhubaneswar as Assistant Professor.
Oct, 2022
Chairing a session (Session Name: Computing - II) at International Conference on Computing, Communication and Learning (CoCoLe), NIT Warangal, Telengana.
May, 2022
Our Paper on Design Space Exploration Framework for Domain-Specific SoCs to appear in ACM Transactions on Embedded Computing Systems (TECS), Congratulations Behzad!.
February, 2022
Serving as Area Chair (Computer Architecture) Journal of Systems Research (JSys).
January, 2022
Serving on program Committee European Conference on Computer Systems (EuroSys'22 Artifacts).
August 25, 2021
Successfully defended my PhD dissertation titled "Improving Performance and Energy Efficiency of GPUs through Locality Analysis". Officially a Doctor now :-). I shall be joining Harvard University as Postdoctoral Fellow in Computer Architecture/Systems and VLSI this Fall.
July 15, 2021
Two of our papers accepted to appear in NAS 2021.
February 22, 2021
Passed PhD Dissertation Proposal Defense Exam .
February 16, 2021
Our paper on GPU Data-Locality and Thread-Block Scheduling accepted to TACO 2021.
June 15, 2020
Selected for GHC 2020 student scholarship.
May 22, 2020
Two of our papers accepted to ISLPED 2020.
June 2, 2019
Won student travel grant to attend ISCA 2019 and HPDC 2019.
May 2, 2019
Our paper on GPU Undervolting and Reliability accepted to ICS 2019.
Oct 5, 2018
Our book on BCI System Design is online now!
May 10, 2018
Our book on BCI System Design approved to be published under Series SpringerBriefs in Computational Intelligence.
Apr 14, 2018
Won student travel grant to attend ISCA 2018.
Feb 16, 2018
I will be joining Samsung Austin R&D Center as a GPU Modelling Intern for the summer'18, Austin,TX.
Feb 5, 2018
Won student travel grant to attend Grad Cohort for Women 2018.
Sept 14, 2017
Won student travel grant to attend MICRO 2017.
Sept 9, 2017
Won student travel grant to attend Third Career Workshop for Women and Minorities in Computer Architecture.
July 2, 2017
Our paper on Data Dependency Support in GPU accepted to MICRO 2017.
March 15, 2016
Passed Oral Qualifying exam. PhD Candidate now!
August 8, 2016
Won student travel grant to attend NAS 2016.


Education and Experience

Postdoctoral Fellow in Computer Science

HARVARD UNIVERSITY | Cambridge, MA, USA | 2021- current
Research Topic: Design and Modeling of Domain Specific SoCs.
Advisor: Dr. David Brooks

PhD in Computer Science

UNIVERSITY OF CALIFORNIA, RIVERSIDE (UCR) | Riverside, CA, USA | 2015-2021
(CGPA: 3.9/4.0)
Thesis: Improving Performance and Energy Efficiency of GPUs through Locality Analysis.
Advisor: Dr. Laxmi N. Bhuyan

• Dean’s Distinguished Fellowship

GPU Modelling Intern

Samsung Austin R&D Center, USA | 06/2018 - 09/2018

• Award of Excellence

Master of Technology, Advanced Electronics Systems

CSIR – CENTRAL ELECTRONICS ENGINEERING RESEARCH INSTITUTE (CSIR-CEERI) | Pilani, India | 2012-2014
(CGPA: 8.53/10)
Thesis: Patient Assistance System Using Brain Computer Interface.
Advisor: Dr. Jagdish Lal Raheja

• Quick-Hire Fellowship by Government of India, CSIR.

Bachelor of Technology, Electronics and Telecommunications Engineering

VEER SURENDRA SAI UNIVERSITY OF TECHNOLOGY (VSSUT, FORMERLY UCE) | BURLA, India | 2008-2012
(CGPA: 9.69/10)

• Governor’s Gold Medal for being ranked first among all undergraduate students in university. | University Topper




Teaching

IITBBS-UG-CS3L002
Computer Organization and Architecture (Autumn 2023)

IITBBS-PG-CS6L009
High Performance Computer Architecture (Spring 2023)

IITBBS-UG-CS1P001
Introduction to Programming and Data Structures (Spring/Autumn 2023-24)

Harvard-PG-CS249r
Tiny Machine Learning : Applied Machine Learning on Embedded IoT Devices (Fall 2022)

UCR-UG-CS005
Introduction To Computer Programming(Fall 2019)

UCR-PG-CS203
Advanced Computer Architecture(Winter 2018, Winter 2019)

UCR-PG-CS213
Multiprocessor Architecture and Programming(Spring 2018, Winter 2020)

CEERI-PG-2-219
Advanced Signal and Image Processing (2014)


Selected Publications

Google Scholar; DBLP

Conference

A2

Guac: Energy-Aware and SSA-Based Generation of Coarse-Grained Merged Accelerators from LLVM-IR. Arxiv '24

Iulian Brumar, Rodrigo Rocha, Alex Bernat, Devashree Tripathy, David Brooks, Gu-Yeon Wei, CoRR abs/2402.13513 (2024)
(with Harvard University and University of Edinburgh, UK)

Designing accelerators for resource- and power-constrained applications is a daunting task. High-level Synthesis (HLS) addresses these constraints through resource sharing, an optimization at the HLS binding stage that maps multiple operations to the same functional unit. However, resource sharing is often limited to reusing instructions within a basic block. Instead of searching globally for the best control and dataflow graphs (CDFGs) to combine, it is constrained by existing instruction mappings and schedules. Coarse-grained function merging (CGFM) at the intermediate representation (IR) level can reuse control and dataflow patterns without dealing with the post-scheduling complexity of mapping operations onto functional units, wires, and registers. The merged functions produced by CGFM can be translated to RTL by HLS, yielding Coarse Grained Merged Accelerators (CGMAs). CGMAs are especially profitable across applications with similar data- and control-flow patterns. Prior work has used CGFM to generate CGMAs without regard for which CGFM algorithms best optimize area, power, and energy costs. We propose Guac, an energy-aware and SSA-based (static single assignment) CGMA generation methodology. Guac implements a novel ensemble of cost models for efficient CGMA generation. We also show that CGFM algorithms using SSA form to merge control- and dataflow graphs outperform prior non-SSA CGFM designs. We demonstrate significant area, power, and energy savings with respect to the state of the art. In particular, Guac more than doubles energy savings with respect to the closest related work while using a strong resource-sharing baseline.
@article{brumar2024guac, title={Guac: Energy-Aware and SSA-Based Generation of Coarse-Grained Merged Accelerators from LLVM-IR}, author={Brumar, Iulian and Rocha, Rodrigo and Bernat, Alex and Tripathy, Devashree and Brooks, David and Wei, Gu-Yeon}, journal={arXiv preprint arXiv:2402.13513}, year={2024} }
A1

PerfSAGE: Generalized Inference Performance Predictor for Arbitrary Deep Learning Models on Edge Devices Arxiv '23

Y Chai, D Tripathy, C Zhou, D Gope, I Fedorov, R Matas, D Brooks, GY Wei, P Whatmough, arXiv preprint arXiv:2301.10999, 2023 CoRR abs/2402.13513 (2024)
(with Harvard University and ARM Research, USA)

@article{chai2023perfsage, title={PerfSAGE: Generalized Inference Performance Predictor for Arbitrary Deep Learning Models on Edge Devices}, author={Chai, Yuji and Tripathy, Devashree and Zhou, Chuteng and Gope, Dibakar and Fedorov, Igor and Matas, Ramon and Brooks, David and Wei, Gu-Yeon and Whatmough, Paul}, journal={arXiv preprint arXiv:2301.10999}, year={2023} }
C7

ArchGym: Establishing Stronger Baselines for Machine-Learning Assisted Architecture Design. ISCA '23

Srivatsan Krishnan, Amir Yazdanbakhsh, Jason Jabbour, Ikechukwu Uchendu, Susobhan Ghosh, Behzad Boroujerdian, Daniel Richins, Devashree Tripathy, Aleksandra Faust, and Vijay Janapa Reddi (with Harvard University, University of Texas at Austin, Google Research/ Brain Team and Facebook Research)
International Symposium on Computer Architecture (ISCA), 2023, June 17 - 23, Orlando, Florida, USA. (acceptance rate: 21% , Core Ranking: A*)

Machine learning (ML) has become a prevalent approach to tame the complexity of design space exploration for domain-specific architectures. While appealing, using ML for design space exploration poses several challenges. First, it is not straightforward to identify the most suitable algorithm from an ever-increasing pool of ML methods. Second, assessing the trade-offs between performance and sample efficiency across these methods is inconclusive. Finally, the lack of a holistic framework for fair, reproducible, and objective comparison across these methods hinders the progress of adopting ML-aided architecture design space exploration and impedes creating repeatable artifacts. To mitigate these challenges, we introduce ArchGym, an open-source gymnasium and easy-to-extend framework that connects a diverse range of search algorithms to architecture simulators. To demonstrate its utility, we evaluate ArchGym across multiple vanilla and domain-specific search algorithms in the design of a custom memory controller, deep neural network accelerators, and a custom SoC for AR/VR workloads, collectively encompassing over 21K experiments. The results suggest that with an unlimited number of samples, ML algorithms are equally favorable to meet the user-defined target specification if its hyperparameters are tuned thoroughly; no one solution is necessarily better than another (e.g., reinforcement learning vs. Bayesian methods). We coin the term "hyperparameter lottery" to describe the relatively probable chance for a search algorithm to find an optimal design provided meticulously selected hyperparameters. Additionally, the ease of data collection and aggregation in ArchGym facilitates research in ML-aided architecture design space exploration. As a case study, we show this advantage by developing a proxy cost model with an RMSE of 0.61% that offers a 2,000-fold reduction in simulation time. Code and data for ArchGym is available at https://bit.ly/ArchGym.
@inproceedings{krishnan2023archgym, title={ArchGym: An Open-Source Gymnasium for Machine Learning Assisted Architecture Design}, author={Krishnan, Srivatsan and Yazdanbakhsh, Amir and Prakash, Shvetank and Jabbour, Jason and Uchendu, Ikechukwu and Ghosh, Susobhan and Boroujerdian, Behzad and Richins, Daniel and Tripathy, Devashree and Faust, Aleksandra and others}, booktitle={Proceedings of the 50th Annual International Symposium on Computer Architecture}, pages={1--16}, year={2023} }
C6

LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs NAS '21

Devashree Tripathy, Amirali Abdolrashidi, Quan Fan, Daniel Wong and Manoranjan Satpathy
15th IEEE International Conference on Networking, Architecture, and Storage (NAS 2021), October 24-26, Riverside, California, USA.

Exploiting data locality in GPGPUs is critical for efficiently using the smaller data caches and handling the memory bottleneck problem. This paper proposes a thread block-centric locality analysis, which identifies the locality among the thread blocks (TBs) in terms of a number of common data references. In LocalityGuru, we seek to employ a detailed just-in-time (JIT) compilation analysis of the static memory accesses in the source code and derive the mapping between the threads and data indices at kernel-launch-time. Our locality analysis technique can be employed at multiple granularities such as threads, warps, and thread blocks in a GPU Kernel. This information can be leveraged to help make smarter decisions for locality-aware data-partition, memory page data placement, cache management, and scheduling in single-GPU and multi-GPU systems. The results of the LocalityGuru PTX analyzer are then validated by comparing with the Locality graph obtained through profiling. Since the entire analysis is carried out by the compiler before the kernel launch time, it does not introduce any timing overhead to the kernel execution time.
@INPROCEEDINGS{9605411, author={Tripathy, Devashree and Abdolrashidi, AmirAli and Fan, Quan and Wong, Daniel and Satpathy, Manoranjan}, booktitle={2021 IEEE International Conference on Networking, Architecture and Storage (NAS)}, title={LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs}, year={2021}, volume={}, number={}, pages={1-8}, doi={10.1109/NAS51552.2021.9605411}}
C5

ICAP: Designing Inrush Current Aware Power Gating Switch for GPGPU NAS '21

Hadi Zamani, Devashree Tripathy, Ali Jahanshahi, and Daniel Wong
15th IEEE International Conference on Networking, Architecture, and Storage (NAS 2021), October 24-26, Riverside, California, USA.

The leakage energy of GPGPU can be reduced by power gating the idle logic or undervolting the storage structures; however, the performance and reliability of the system degrade due to large wake-up time and inrush current at the time of activation. In this paper, we thoroughly analyze the realistic Break-Even Time (BET) and inrush current for various components in GPGPU architecture considering the recent design of multi-modal Power Gating Switch (PGS). Then, we introduce a new PGS which covers the current PGS drawbacks. Our redesigned PGS is carefully tailored to minimize the inrush current and BET. GPGPUSim simulation results for various applications show that by incorporating the proposed PGS into GPGPU-Sim, we can save leakage energy up to 82%, 38%, and 60% for register files, integer units, and floating units respectively.
@INPROCEEDINGS{9605434, author={Zamani, Hadi and Tripathy, Devashree and Jahanshahi, Ali and Wong, Daniel}, booktitle={2021 IEEE International Conference on Networking, Architecture and Storage (NAS)}, title={ICAP: Designing Inrush Current Aware Power Gating Switch for GPGPU}, year={2021}, volume={}, number={}, pages={1-8}, doi={10.1109/NAS51552.2021.9605434}}
C4

Slumber: Static-Power Management for GPGPU Register Files ISLPED '20

Devashree Tripathy, Hadi Zamani, Debiprasanna Sahoo, Laxmi Narayan Bhuyan, and Manoranjan Satpathy
ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED), 2020, August 10-12, Boston, Massachussetts, USA (Virtual due to covid-19).

The leakage power dissipation has become one of the major concerns with technology scaling. The GPGPU register file has grown insize over last decade in order to support the parallel execution of thousands of threads. Given that each thread has its own dedicated setof physical registers, these registers remain idle when correspondingthreads go for long latency operation. Existing research shows that the leakage energy consumption of the register file can be reduced by under volting the idle registers to a data-retentive low-leakage voltage (Drowsy Voltage) to ensure that the data is not lost while not in use. In this paper, we develop a realistic model for determining the wake-up time of registers from various under-volting and power gating modes. Next, we propose a hybrid energy saving technique where a combination of power-gating and under-volting can be usedto save optimum energy depending on the idle period of the registers with a negligible performance penalty. Our simulation shows that the hybrid energy-saving technique results in 94% leakage energy savings in register files on an average when compared with the conventional clock gating technique and 9% higher leakage energy saving compared to the state-of-art technique.
@inproceedings{tripathy2020slumber, title={Slumber: static-power management for gpgpu register files}, author={Tripathy, Devashree and Zamani, Hadi and Sahoo, Debiprasanna and Bhuyan, Laxmi N and Satpathy, Manoranjan}, booktitle={Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design}, pages={109--114}, year={2020} }
C3

SAOU: Safe Adaptive Overclocking and Undervolting for Energy-Efficient GPU Computing ISLPED '20

Hadi Zamani, Devashree Tripathy , Laxmi Narayan Bhuyan and Zhizhong Chen
ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED), 2020, August 10-12, Boston, Massachussetts, USA (Virtual due to covid-19).

The current trend of ever-increasing performance in scientific ap-plications comes with tremendous growth in energy consumption. In this paper, we present a framework for GPU applications, which reduces energy consumption in GPUs through Safe Overclocking and Undervolting (SAOU) without sacrificing performance. The idea is to increase the frequency beyond the safe frequency f_safeMax and undervolt below V_safeMin to get maximum energy saving. Since such overclocking and undervolting may give rise to faults, we employ an enhanced checkpoint-recovery technique to cover the possible errors. Empirically, we explore different errors and derive a fault model that can set the undervolting and overclocking level for maximum energy saving. We target cuBLAS Matrix Multiplication (cuBLAS-MM) kernel for error correction using the checkpoint and recovery(CR) technique as an example of scientific applications. In case of cuBLAS, SAOU achieves up to 22% energy reduction through undervolting and overclocking without sacrificing the performance.
@inproceedings{zamani2020saou, title={SAOU: safe adaptive overclocking and undervolting for energy-efficient GPU computing}, author={Zamani, Hadi and Tripathy, Devashree and Bhuyan, Laxmi and Chen, Zizhong}, booktitle={Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design}, pages={205--210}, year={2020} }
C2

GreenMM: Energy-Efficient GPU Matrix Multiplication Through Undervolting ICS '19

Hadi Zamani, Yuanlai Liu, Devashree Tripathy, Laxmi Narayan Bhuyan, and Zizhong Chen
International Conference on Supercomputing (ICS), 2019, June 26-28, Phoenix, Arizona, USA. (acceptance rate: 23.3%, Core ranking: A)

The current trend of ever-increasing performance in scientific applications comes with tremendous growth in energy consumption. In this paper, we present GreenMM framework for matrix multiplication, which reduces energy consumption in GPUs through undervolting without sacrificing the performance. The idea in thispaper is to undervolt the GPU beyond the minimum operating voltage (Vmin) to save maximum energy while keeping the frequency constant. Since such undervolting may give rise to faults, we design an Algorithm Based Fault Tolerance (ABFT) algorithm to detectand correct those errors. We target cuBLAS Matrix Multiplication(cuBLAS-MM), as a key kernel used in many scientific applications. Empirically, we explore different errors and derive a fault model as a function of undervolting levels and matrix sizes. Then,using the model, we configure the proposed FT-cuBLAS-MM algorithm. We show that energy consumption is reduced up to 19.8%. GreenMM also improves the GFLOPS/Watt by 9% with negligible performance overhead.
@inproceedings{DBLP:conf/ics/ZamaniLTBC19, author = {Hadi Zamani and Yuanlai Liu and Devashree Tripathy and Laxmi N. Bhuyan and Zizhong Chen}, title = {GreenMM: energy efficient {GPU} matrix multiplication through undervolting}, booktitle = {Proceedings of the {ACM} International Conference on Supercomputing, {ICS} 2019, Phoenix, AZ, USA, June 26-28, 2019}, pages = {308--318}, year = {2019}, crossref = {DBLP:conf/ics/2019}, url = {https://doi.org/10.1145/3330345.3330373}, doi = {10.1145/3330345.3330373}, timestamp = {Wed, 19 Jun 2019 08:40:19 +0200}, biburl = {https://dblp.org/rec/bib/conf/ics/ZamaniLTBC19}, bibsource = {dblp computer science bibliography, https://dblp.org} }
C1

WIREFRAME: Supporting Data-dependent Parallelism through Dependency Graph Execution in GPUs. MICRO '17

AmirAli Abdolrashidi, Devashree Tripathy, Mehmet Esat Belviranli, Laxmi Narayan Bhuyan, and Daniel Wong
The 50th International Symposium on Microarchitecture (MICRO), 2017, October 14-18, Boston, Massachusetts, USA. (acceptance rate: 18.6%, Core Ranking: A*)

GPUs lack fundamental support for data-dependent parallelism and synchronization. While CUDA Dynamic Parallelism signals progress in this direction, many limitations and challenges still remain. This paper introduces WIREFRAME, a hardware-software solution that enables generalized support for data-dependent parallelism and synchronization. Wireframe enables applications to naturally express execution dependencies across different thread blocks through a dependency graph abstraction at run-time, which is sent to the GPU hardware at kernel launch. At run-time, the hardware enforces the dependencies specified in the dependency graph through a dependency-aware thread block scheduler. Overall, Wireframe is able to improve total execution time up to 65.20% with an average of 45.07%.
@inproceedings{abdolrashidi2017wireframe, title={Wireframe: supporting data-dependent parallelism through dependency graph execution in GPUs}, author={Abdolrashidi, Amir Ali and Tripathy, Devashree and Belviranli, Mehmet Esat and Bhuyan, Laxmi Narayan and Wong, Daniel}, booktitle={Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture}, pages={600--611}, year={2017}, organization={ACM} }

Journal

J4

FARSI: Early-stage Design Space Exploration Framework to Tame the Domain-specific System-on-chip Complexity ACM TRANSACTIONS TECS'22

Behzad Boroujerdian, Ying Jing, Devashree Tripathy, Amit Kumar, Lavanya Subramanian, Luke Yen, Vincent Lee, Vivek Venkatesan, Amit Jindal, Robert Shearer, Vijay Janapa Reddi (with Harvard University, University of Texas at Austin, University of Illinois Urbana-Champaign and Facebook Research)
ACM TRANSACTIONS on Embedded Computing Systems. (Impact Factor: 1.53(2022), SCImago Journal Rank (SJR): 0.476, Research Impact Score: 4.95)

Domain-specific SoCs (DSSoCs) are an attractive solution for domains with extremely stringent power, performance, and area constraints. However, DSSoCs suffer from two fundamental complexities. On the one hand, their many specialized hardware blocks result in complex systems and thus high development effort. On the other hand, their many system knobs expand the complexity of design space, making the search for the optimal design difficult. Thus to reach prevalence, taming such complexities is necessary. To address these challenges, in this work, we identify the necessary features of an early-stage design space exploration (DSE) framework that targets the complex design space of DSSoCs and provide an instance of one such framework that we refer to as FARSI. FARSI provides an agile system-level simulator with speed up and accuracy of 8,400x and 98.5% compared to Synopsys Platform Architect. FARSI also provides an efficient exploration heuristic and achieves up to 62x and 35x improvement in convergence time compared to the classic simulated annealing (SA) and modern Multi-Objective Optimistic Search (MOOS). This is done by augmenting SA with architectural reasoning such as locality exploitation and bottleneck relaxation. Furthermore, we embed various co-design capabilities and show that, on average, they have a 32% impact on the convergence rate. Finally, we demonstrate that using development-cost-aware policies can lower the system complexity, both in terms of the component count and variation by as much as 60% and 82% (e,g., for Network-on-a-Chip subsystem), respectively. PS: This paper targets the Special Issue on Domain-Specific System-on-Chip Architectures and Run-Time Management Techniques.

@article{boroujerdian2023farsi, title={FARSI: An early-stage design space exploration framework to tame the domain-specific system-on-chip complexity}, author={Boroujerdian, Behzad and Jing, Ying and Tripathy, Devashree and Kumar, Amit and Subramanian, Lavanya and Yen, Luke and Lee, Vincent and Venkatesan, Vivek and Jindal, Amit and Shearer, Robert and others}, journal={ACM Transactions on Embedded Computing Systems}, volume={22}, number={2}, pages={1--35}, year={2023}, publisher={ACM New York, NY} }

J3

PAVER: Locality Graph-based Thread Block Scheduling for GPUs ACM TRANSACTIONS TACO'21

Devashree Tripathy, Amirali Abdolrashidi, Laxmi Bhuyan, Liang Zhou and Daniel Wong
ACM TRANSACTIONS on Architecture and Code Optimization. (Impact Factor: 1.309(2019), SCImago Journal Rank (SJR): 0.263)
* Invited to present at European Network on High Performance and Embedded Architecture and Compilation (HiPEAC, 2022) on June 20-22, Budapest, Hungary.

The massive parallelism present in GPUs comes at the cost of reduced L1 and L2 cache sizes per thread, leading to serious cache contention problems such as thrashing. Hence, the data access locality of an application should be considered during thread scheduling to improve execution time and energy consumption. Recent works have tried to use the locality behavior of regular and structured applications in thread scheduling, but the difficult case of irregular and unstructured parallel applications remains to be explored. We present PAVER, a priority-aware vertex scheduler, which takes a graph-theoretic approach towards thread scheduling. We analyze the cache locality behavior among thread blocks (TBs) through a just-in-time (JIT) compilation, and represent the problem using a graph representing the TBs and the locality among them. This graph will then be partitioned to TB groups that display maximum data sharing, which are then assigned to the same SM by the locality-aware TB scheduler. Through exhaustive simulation in Fermi, Pascal and Volta architectures using a number of scheduling techniques, we show that our graph theoretic-guided TB scheduler reduces L2 accesses by 43.3%, 48.5%, 40.21% and increases the average performance benefit by 30%, 50.4%, 40.2% for the benchmarks with high inter-TB locality.

@article{10.1145/3451164, author = {Tripathy, Devashree and Abdolrashidi, Amirali and Bhuyan, Laxmi Narayan and Zhou, Liang and Wong, Daniel}, title = {PAVER: Locality Graph-Based Thread Block Scheduling for GPUs}, year = {2021}, issue_date = {September 2021}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, volume = {18}, number = {3}, issn = {1544-3566}, url = {https://doi.org/10.1145/3451164}, doi = {10.1145/3451164}, abstract = {The massive parallelism present in GPUs comes at the cost of reduced L1 and L2 cache sizes per thread, leading to serious cache contention problems such as thrashing. Hence, the data access locality of an application should be considered during thread scheduling to improve execution time and energy consumption. Recent works have tried to use the locality behavior of regular and structured applications in thread scheduling, but the difficult case of irregular and unstructured parallel applications remains to be explored.We present PAVER, a Priority-Aware Vertex schedulER, which takes a graph-theoretic approach toward thread scheduling. We analyze the cache locality behavior among thread blocks (TBs) through a just-in-time compilation, and represent the problem using a graph representing the TBs and the locality among them. This graph is then partitioned to TB groups that display maximum data sharing, which are then assigned to the same streaming multiprocessor by the locality-aware TB scheduler. Through exhaustive simulation in Fermi, Pascal, and Volta architectures using a number of scheduling techniques, we show that PAVER reduces L2 accesses by 43.3%, 48.5%, and 40.21% and increases the average performance benefit by 29%, 49.1%, and 41.2% for the benchmarks with high inter-TB locality.}, journal = {ACM Trans. Archit. Code Optim.}, month = {jun}, articleno = {32}, numpages = {26}, keywords = {thread block, dependency graph, locality, GPGPU} }

J2

An improved load‑balancing mechanism based on deadline failure recovery on GridSim EC, Springer'16

Deepak Kumar Patel, Devashree Tripathy, and C.R Tripathy
Engineering with Computers April 2016, SPRINGER, Volume 32, Issue 2, pp 173–188 (Impact Factor: 7.963 (2020)).

Grid computing has emerged a new field, distinguished from conventional distributed computing. It focuses on large-scale resource sharing, innovative applications and in some cases, high performance orientation. The Grid serves as a comprehensive and complete system for organizations by which the maximum utilization of resources is achieved. The load balancing is a process which involves the resource management and an effective load distribution among the resources. Therefore, it is considered to be very important in Grid systems. For a Grid, a dynamic, distributed load balancing scheme provides deadline control for tasks. Due to the condition of deadline failure, developing, deploying, and executing long running applications over the grid remains a challenge. So, deadline failure recovery is an essential factor for Grid computing. In this paper, we propose a dynamic distributed load-balancing technique called “Enhanced GridSim with Load balancing based on Deadline Failure Recovery” (EGDFR) for computational Grids with heterogeneous resources. The proposed algorithm EGDFR is an improved version of the existing EGDC in which we perform load balancing by providing a scheduling system which includes the mechanism of recovery from deadline failure of the Gridlets. Extensive simulation experiments are conducted to quantify the performance of the proposed load-balancing strategy on the GridSim platform. Experiments have shown that the proposed system can considerably improve Grid performance in terms of total execution time, percentage gain in execution time, average response time, resubmitted time and throughput. The proposed load-balancing technique gives 7 % better performance than EGDC in case of constant number of resources, whereas in case of constant number of Gridlets, it gives 11 % better performance than EGDC.

@Article{Patel2016, author="Patel, Deepak Kumar and Tripathy, Devashree and Tripathy, Chitaranjan", title="An improved load-balancing mechanism based on deadline failure recovery on GridSim", journal="Engineering with Computers", year="2016", month="Apr", day="01", volume="32", number="2", pages="173--188", abstract="Grid computing has emerged a new field, distinguished from conventional distributed computing. It focuses on large-scale resource sharing, innovative applications and in some cases, high performance orientation. The Grid serves as a comprehensive and complete system for organizations by which the maximum utilization of resources is achieved. The load balancing is a process which involves the resource management and an effective load distribution among the resources. Therefore, it is considered to be very important in Grid systems. For a Grid, a dynamic, distributed load balancing scheme provides deadline control for tasks. Due to the condition of deadline failure, developing, deploying, and executing long running applications over the grid remains a challenge. So, deadline failure recovery is an essential factor for Grid computing. In this paper, we propose a dynamic distributed load-balancing technique called ``Enhanced GridSim with Load balancing based on Deadline Failure Recovery'' (EGDFR) for computational Grids with heterogeneous resources. The proposed algorithm EGDFR is an improved version of the existing EGDC in which we perform load balancing by providing a scheduling system which includes the mechanism of recovery from deadline failure of the Gridlets. Extensive simulation experiments are conducted to quantify the performance of the proposed load-balancing strategy on the GridSim platform. Experiments have shown that the proposed system can considerably improve Grid performance in terms of total execution time, percentage gain in execution time, average response time, resubmitted time and throughput. The proposed load-balancing technique gives 7 {\%} better performance than EGDC in case of constant number of resources, whereas in case of constant number of Gridlets, it gives 11 {\%} better performance than EGDC.", issn="1435-5663", doi="10.1007/s00366-015-0409-y", url="https://doi.org/10.1007/s00366-015-0409-y" }

J1

Survey of load balancing techniques for Grid JNCS, Elsevier'16

Deepak Kumar Patel, Devashree Tripathy, and C.R Tripathy
Journal of Network and Computer Applications, ELSEVIER Volume 65, April 2016, Pages 103-119 (Impact Factor: 6.281 (2020)).

In recent days, due to the rapid technological advancements, the Grid computing has become an important area of research. Grid computing has emerged a new field, distinguished from conventional distributed computing. It focuses on large-scale resource sharing, innovative applications and in some cases, high-performance orientation. A Grid is a network of computational resources that may potentially span many continents. The Grid serves as a comprehensive and complete system for organizations by which the maximum utilization of resources is achieved. The load balancing is a process which involves the resource management and an effective load distribution among the resources. Therefore, it is considered to be very important in Grid systems. The proposed work presents an extensive survey of the existing load balancing techniques proposed so far. These techniques are applicable for various systems depending upon the needs of the computational Grid, the type of environment, resources, virtual organizations and job profile it is supposed to work with. Each of these models has its own merits and demerits which forms the subject matter of this survey. A detailed classification of various load balancing techniques based on different parameters has also been included in the survey.

@article{PATEL2016103, title = "Survey of load balancing techniques for Grid", journal = "Journal of Network and Computer Applications", volume = "65", number = "", pages = "103 - 119", year = "2016", note = "", issn = "1084-8045", doi = "http://dx.doi.org/10.1016/j.jnca.2016.02.012", url = "http://www.sciencedirect.com/science/article/pii/S1084804516000953", author = "Deepak Kumar Patel and Devashree Tripathy and C.R. Tripathy", keywords = "Grid computing", keywords = "Distributed systems", keywords = "Load balancing" }

Book

B1

Real-Time BCI System Design to Control Arduino Based Speed Controllable Robot Using EEG Springer '18

Swagata Das, Devashree Tripathy, Jagdish Lal Raheja
SpringerBriefs in Computational Intelligence, 2018.

@book{das2018real, title={Real-Time BCI System Design to Control Arduino Based Speed Controllable Robot Using EEG}, author={Das, Swagata and Tripathy, Devashree and Raheja, Jagdish Lal}, year={2018}, publisher={Springer} }

Book Chapter

BC-1

Design and Implementation of Brain Computer Interface Based Robot Motion Control Springer'14

Devashree Tripathy, and Jagdish Lal Raheja
Proceedings of the 3rd International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA), 2014.

In this paper, a Brain Computer Interactive (BCI) robot motion control system for patients’ assistance is designed and implemented. The proposed system acquires data from the patient’s brain through a group of sensors using Emotiv Epoc neuroheadset. The acquired signal is processed. From the processed data the BCI system determines the patient’s requirements and accordingly issues commands (output signals). The processed data is translated into action using the robot as per the patient’s requirement. A Graphics user interface (GUI) is developed by us for the purpose of controlling the motion of the Robot. Our proposed system is quite helpful for persons with severe disabilities and is designed to help persons suffering from spinal cord injuries/ paralytic attacks. It is also helpful to all those who can’t move physically and find difficulties in expressing their needs verbally.
@Inbook{Tripathy2015, author="Tripathy, Devashree and Raheja, Jagdish Lal", editor="Satapathy, Suresh Chandra and Biswal, Bhabendra Narayan and Udgata, Siba K. and Mandal, J. K.", title="Design and Implementation of Brain Computer Interface Based Robot Motion Control", bookTitle="Proceedings of the 3rd International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA) 2014: Volume 2", year="2015", publisher="Springer International Publishing", address="Cham", pages="289--296", abstract="In this paper, a Brain Computer Interactive (BCI) robot motion control system for patients' assistance is designed and implemented. The proposed system acquires data from the patient's brain through a group of sensors using Emotiv Epoc neuroheadset. The acquired signal is processed. From the processed data the BCI system determines the patient's requirements and accordingly issues commands (output signals). The processed data is translated into action using the robot as per the patient's requirement. A Graphics user interface (GUI) is developed by us for the purpose of controlling the motion of the Robot. Our proposed system is quite helpful for persons with severe disabilities and is designed to help persons suffering from spinal cord injuries/ paralytic attacks. It is also helpful to all those who can't move physically and find difficulties in expressing their needs verbally.", isbn="978-3-319-12012-6", doi="10.1007/978-3-319-12012-6_32", url="https://doi.org/10.1007/978-3-319-12012-6_32" }

Academic Professional Service


Awards/Honors

2021
Student Travel Grant for NAS 2021.
2020
Grace Hopper Celebration 2020 Scholar.
2019
Student Travel Grant for ISCA 2019 and HPDC 2019.
2018
Student Travel Grant for ISCA 2018.
2018
Student Travel Grant for 2018 CRA-W Grad Cohort
2017
Student Travel Grant for MICRO 2017.
2017
Student Travel Grant for Third Career Workshop for Women and Minorities in Computer Architecture.
2016
Student Travel Grant for NAS 2016.
2015
Dean’s Distinguished Fellowship, Bourns College of Engineering, University of California, Riverside.
2012-2014
Quick-Hire Fellowship by Government of India
2012
Ranked first among all undergraduate students of VSSUT Burla.
2009
Golden Jubilee Meritorious Girls Scholarship, VSSUT Burla. (Awarded for being ranked first among 300+ students among all disciplines of college of engineering in the freshman year.)

Others

Welcome! you are the free counterth vistor of my homepage.

Free Visitor Maps at VisitorMap.org
Get a FREE visitor map for your site!