Valerio Gheri’s Post

Engineering Director at Veepee | Tribe Lead

2mo

Have you ever wondered whether the order of fields within a Go struct affects memory consumption or application performance? I did, and it turns out that it might indeed have an impact. The reason? Memory alignment requirements in modern CPU architectures. I documented my understanding of the topic at https://lnkd.in/ddDsG8MU . Take a look if you're interested, feedback is appreciated!

Go memory layout and struct alignment

blog.valeriogheri.com

To view or add a comment, sign in

More Relevant Posts

Nathan Bronson

Member of Technical Staff at OpenAI
4mo
Report this post
One of the things I work on a Rockset is our CPU profiling infrastructure. We've developed a cool technique to precisely correlate perf samples with application data, such as query IDs. If you profile requests in a concurrent system it might be useful for you too: https://lnkd.in/dP2DuFRY

Profiling Individual Queries in a Concurrent System

rockset.com

4 Comments
Like Comment
To view or add a comment, sign in
Dmitri Petrov
4mo
Report this post
Definitely worth a read for people who are interested in understanding performance of concurrent systems.

Nathan Bronson

Member of Technical Staff at OpenAI
4mo

One of the things I work on a Rockset is our CPU profiling infrastructure. We've developed a cool technique to precisely correlate perf samples with application data, such as query IDs. If you profile requests in a concurrent system it might be useful for you too: https://lnkd.in/dP2DuFRY

Profiling Individual Queries in a Concurrent System

rockset.com
Like Comment
To view or add a comment, sign in
Philip Koopman

Autonomous Vehicle Safety, Embedded Software, UL 4600, Consulting, (He/him.) Personal account; likes/shares are interest and not endorsements; silence does not imply agreement.
4mo
Report this post
Here is a sneak peak section from my new book: I updated the CRC/checksum speed comparison to include a new dual-sum checksum variant and some novel speedup techniques. You can get twice the data word length at HD=3 and also twice the speed compared to Fletcher/Adler checksum by using a DualX checksum. Or use the DualXP variant to get HD=4 at about the same data word length that a Fletcher/Adler gets HD=3. (Explained in the blog, with pointer in book support site to source code.) Much longer still HD=3/HD=4 is available using a Koopman Checksum. And a CRC is still there as a further tradeoff point for length vs. speed. (Speeds depend on the CPU you're using; these are for 32-bit checksums on a 32-bit desktop CPU.) #embedded #crc #checksum https://lnkd.in/e77HZQWn

Comparative speeds for different Checksum & CRC implementations

checksumcrc.blogspot.com

3 Comments
Like Comment
To view or add a comment, sign in
Taras Tsugrii

GenAI builder
10mo
Report this post
In the world of computer science and software development, understanding CPU microarchitecture is like having a magic key to optimize your code. This complex topic delves into the inner workings of your computer's processor, and it's the cornerstone for crafting software that runs faster and smoother. https://lnkd.in/e28B_NJq is a great starting point for anyone interested in learning about pipelining, branch prediction and data dependencies with easy to follow visulization and explanation.

Architecture All Access: Modern CPU Architecture 2 - Microarchitecture Deep Dive | Intel Technology

https://www.youtube.com/
Like Comment
To view or add a comment, sign in
Noura EL ALLAM

SWE | Python | CS Bachelor at @UoPeople | Investor | Opinions are my own
11mo
Report this post
The CPU flame graph 🔥 is a visualization tool that unravels the mystery of CPU time consumption within the code ... How? By representing function call stack traces, so the width of each stack frame tells you how much CPU time it's responsible for, this will enable you to spot performance bottlenecks and areas crying out 😭 for optimization. There are many tools to generate the flame graph but the most popular ones are: - https://lnkd.in/ee9hVrBc - https://lnkd.in/e7Yw2Tk4 As a developer, you know what readme.md is for so there is no need to write about how to use them! yet let's explain how you can read a flame graph; * Colors are used to differentiate functions, but the specific color doesn't hold significance. * I mentioned earlier that it represents a function call stack traces. Thus, think of the y-axis as your function call stack, from the peak (root) to the base (leaves), each box symbolizes a function call; The box's width? It's a measure of the CPU time that the function consumed. * The width of the box shows how much CPU time a function and its children use, so if a box goes deeper down it means the function is riding along with its parent's call. * In order to spot the bottleneck, you have to look for wide and deep boxes these are potential performance bottlenecks, for the reason that they offer insight into where your code is spending the most time. CPU flame graphs are like a guide through the maze of code performance, and it is important that you learn how to make and understand these graphs to make your code work better (blazing faaaaaast 🚀 ). #PythonPerformance #CodeOptimization #FlameGraphs #PySpy #VisualizePerformance #TechTools #PerformanceInsights #CodeEfficiency #SoftwareDevelopment #softwareengineering #performanceoptimization #performanceengineering #OptimizeYourCode
1 Comment
Like Comment
To view or add a comment, sign in
Sowmya Gundukadi

Associate DevOps consultant
7mo
Report this post
Spark optimizations (Resource Level Optimization) As discussed in last post ,there are two ways of optimizations ,lets discuss about resource level optimization. In order for a job to run efficiently, the right amount of resources should be allocated. Resources include: Memory (RAM) CPU cores (Compute) Let's think about an example. Think about a cluster of 10 nodes, or 10 worker nodes. The following resources are available on every machine: 64GB of RAM & 16 CPU cores Let's determine the maximum number of executors a node can have in order to process data efficiently Strategies for creating Containers 1.Thin Executors: It means More executors created holding minimal resources -total 16 executors -It indicates that 16 executors altogether, each with 1 CPU core and 4GB of RAM. Disadvantages: 1. No Multithreading(no multitasking) 2. Multiple copies result from shared variables. 2.Fat Executors It means Give maximum resources to each executor -with each executor having 16 CPU cores and 64GB of RAM. Disadvantages: 1 HDFS throughput suffers 2. Takes a lot of time for garbage collection Appropriate/Well-balanced method for building Containers We have 16 cores and 64GB RAM 1 core goes for background activity 1GB RAM is allocated to the Operating System Therefore, left with 15 cores and 63GB RAM Requirement: so,5 cores and 21GB RAM / executor is considered ideal and we can have 1. Multithreading within executor 2.HDFS throughput shouldn't suffer #pyspark #apachespark #bigdata
Like Comment
To view or add a comment, sign in
Raj Kumar

--
2mo
Report this post
Want end to end scripts for FINDING WHAT'S CAUSING HIGH CPU(Helpful for day-to-day issues/your upcoming interview) . Here you go - 1. Identify Currently Running Queries Consuming CPU: SELECT r.session_id, r.status, r.wait_type, r.cpu_time, r.total_elapsed_time, r.start_time, s.text AS [Query Text] FROM sys.dm_exec_requests r CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) s ORDER BY r.cpu_time DESC; 2. Identify Top CPU Consumers by Query: SELECT TOP 10 total_worker_time/execution_count AS [Avg CPU Time], execution_count, total_elapsed_time/execution_count AS [Avg Elapsed Time], total_logical_reads/execution_count AS [Avg Logical Reads], total_physical_reads/execution_count AS [Avg Physical Reads], total_logical_writes/execution_count AS [Avg Logical Writes], (SELECT text FROM sys.dm_exec_sql_text(sql_handle)) AS [Query Text] FROM sys.dm_exec_query_stats WHERE total_worker_time > 0 ORDER BY [Avg CPU Time] DESC; 3. Through the help of some specific wait types in SQL Server SELECT wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_ms FROM sys.dm_os_wait_stats WHERE wait_type LIKE 'CXPACKET%' OR wait_type LIKE 'SOS_SCHEDULER_YIELD' OR wait_type LIKE 'THREADPOOL' OR wait_type LIKE 'RESOURCE_SEMAPHORE' ORDER BY wait_time_ms DESC; I will briefly explain what does these wait types mean & please note that they represent various scenarios where SQL Server tasks might be waiting for resources, including CPU resources. CXPACKET: This wait type indicates parallel query execution waits. High values may suggest that parallelism is causing CPU contention. SOS_SCHEDULER_YIELD: This wait type occurs when a task voluntarily yields the scheduler because it has no more work to do, allowing other tasks to execute. High values may indicate CPU pressure. THREADPOOL: This wait type indicates that SQL Server is waiting for a worker thread to become available. High values may suggest that there are insufficient worker threads to handle the workload. RESOURCE_SEMAPHORE: This wait type occurs when SQL Server is waiting for memory grants. High values may indicate memory pressure leading to CPU contention.
Like Comment
To view or add a comment, sign in
Hirak Das

Data Architect | Performance Tuning | Data Management | Cloud migration
10mo
Report this post
Nano Tips(12): -------------- Internal fragmentation(empty space on a page) may takes more time to execute queries and consumes more resources(Cache, CPU and IO) #sql #performancetuning #databasedesign
Like Comment
To view or add a comment, sign in
Siddharth Barman
1w
Report this post
A quick way to monitor processes for CPU and memory utilization irrespective of the type of operating system. https://lnkd.in/d2aZifyW

Using the script

sbytestream.pythonanywhere.com
Like Comment
To view or add a comment, sign in
Priya Yadav

Big Data | Apache Spark | Databricks | Azure | Python | SQL | IIT-Roorkee | Ex-Times Business Solution
7mo Edited
Report this post
Let's try to understand Parallel Execution and Job Creation in a Cluster Environment with an example. Cluster Configuration: - Cluster Composition: 3 worker nodes - Executor Specifications: Each node contains 3 executors with individual configurations of 3 CPU cores and 1 GB memory per executor. File Execution Details: - File Size: 4 GB - Partition Size: Default setting at 128 MB Calculations: 1. Number of Partitions: - Partition size=min(maxPartitionBytes, filesize/defaultParallelism) - File size / Partition size = 4 GB / 128 MB = 32 partitions 2. Total Number of Tasks to be Executed: - Each partition corresponds to a task, so the total tasks = Number of partitions = 32 tasks 3. Total Number of Tasks Running in Parallel: - Considering the configuration, the maximum parallelism is determined by: - Number of executors * Number of nodes * Number of CPU cores per executor = 3 executors * 3 nodes * 3 CPU cores = 27 tasks running in parallel 4. Number of Jobs: - Each Action applied constitutes a job. Therefore, the number of jobs depends on the number of Actions applied. 5. Number of Stages: - Stages consist of transformations plus any shuffling stages. The count of shuffled transformations plus one denotes the number of stages in the job execution. Tagging Sumit Mittal for review. #ApacheSpark #ClusterComputing #BigData #DataEngineering #SparkProgramming
15 Comments
Like Comment
To view or add a comment, sign in

1,261 followers

47 Posts

View Profile Follow

Valerio Gheri’s Post

More Relevant Posts

Architecture All Access: Modern CPU Architecture 2 - Microarchitecture Deep Dive | Intel Technology

https://www.youtube.com/

Explore topics