Tech

Effective Performance Engineering at Twitter

Published

5 months ago

April 12, 2024

Transcript

Yue: I’m going to talk to you about a type of platform and maybe not the most common kind you run into. This is a case study of how we tried to solve a problem but ended up accidentally almost doing platform engineering. This is about my tenure at Twitter. I was at Twitter for almost 12 years, so this is mostly the second half of it. Now I’m no longer at Twitter.

Debugging a Generator (Charles Steinmetz)

I want to start with a story. I grew up in China, so I read about the story in Chinese. There’s this wizard of an electrical engineer and scientist, Charles Steinmetz. He was originally German, he came to the U.S., so he became a wizard figure in this local town. One of the most famous anecdotes about him was he was asked by Henry Ford to debug a problem about the generator. He’s like, I don’t need tools, give me some pen and paper. He just sat there and thought. On the second night, he asked for a ladder, and he climbed up to the generator, and marked on the side and said, “Replace the plate here and also 16 rounds of coil.” The rest of engineers did as he said, and they did, and the generator performed to perfection. Later, he sent a bill to Ford asking for $10,000. I looked it up, $10,000 back then is about $200,000 in today’s purchase power. For two days’ worth of work, that’s not so bad. Ford was a little bit taken aback, said, why are you asking for so much money? This is his explanation, making the chalk mark, $1, knowing where to make the chalk mark the rest of it. Ever since then, whenever I see performance engineering stories, usually on Hacker News, it’s usually how it goes, like, here is a core loop and we changed that one line in this loop, and suddenly our performance went up by 300%. Then it never really goes to the front page, everybody has a great discussion about what is this strange, AVX vectorized instruction, or how you fit things into L1, L2 cache. Everybody is like, if only I knew where to find those one-line changes, I can be Hacker News famous, and I can make a lot of impact. That tends to be the popular conception of what performance engineering is, is you know exactly where to look, you make that magical touch, and then things automatically get better. Is that true? Should we hire basically a room of wizards and set them loose and then go around holding a chalk and then marking all over the place?

Background

I was actually at QCon, speaking at QCon SF in 2016. Back then I was talking about caching. I spent my first half at Twitter mostly managing caches and developer caches. I weathered through a dozen of user facing incidents, and I spent lots of time digging in the trenches. Actually, shortly after giving the talk, I founded the performance team called IOP. I started the performance team at Twitter in 2017. That went on until last year. Generally, I like things about systems, especially distributed systems. I like things about software and hardware integration. Then performance really is an area that allows me to play with all of them. That is what I have spent time working on.

Why We Need Performance Engineering (More than Ever)

I want to talk to you a little bit about performance engineering. If you want to start pivoting into performance engineering, this is the year to do it. How many companies here have given some cost effectiveness or efficiency mandate? This is one reason to care about performance. If that’s on your mind, here’s what you can do. Number one, I want to first start by saying a little bit about why we actually need performance engineering now more so than ever. There was a wonderful talk by Emery Berger, who is a professor at University of Massachusetts. In his talk called, “Performance Matters,” he pointed out the fact, he said, performance engineering used to be easy because computers get faster about every 18 months. If you’re a performance engineer, you just sit on the beach, drink your cocktail, and buy the next batch of machines in 18 months. I’m like, that’s wrong. If it’s that easy, then you don’t need it. In fact, what happens is people don’t really hire performance engineers until they’re extremely large companies, and the rest of the engineering department just sat on the beach, drank their cocktail, and wait 18 months, and then buy the next generation of computers. Now the moment has finally come that you actually have to put in work. Why is that? This is a slide about all kinds of hardware features. It’s intentionally made busy starting from things like NUMA nodes, and PCIe accelerators like GPUs, and other devices, which have been on the market for at least 15 years. There’s all these terms that represent certain kinds of specialized technologies that you may or may not have heard of. There’s CXL, there’s vectorization, programming NICs, RDMA, whatever. The point is, there’s a lot going on in hardware engineering, because we have all these petty laws of physics getting in the way. There’s the power cap, there’s the thermal cap. You cannot make a computer just run faster, simple things run faster, that’s the best. Instead, you do all these tricks. The heterogeneous evolution of the hardware means it’s getting really difficult to write a good program or the right kind of program to actually take advantage of them. If you think programming many core computer is difficult, just wait until you have to learn 20 different new technologies to write a simple program. It’s really getting out of control.

On the other hand, our software engineers are not doing ourselves any favors. If you look at a modern application, it’s highly complex. On the left it’s a joke by XKCD, but it’s very much true, you have this impossibly tall stack, and nobody knows where your dependency tree really is, and nobody really wants to know. If you zoom out, and you say, let me look at my production. Your production is not even one of those tall stacks, your production is a bunch of those tall stacks. No matter how complex a graph you draw to represent the relationships between the services, reality is usually much worse. On the bottom right is a graph that is actually quite old by now, that’s a 5-year-old graph back from when Twitter was simpler. This is the call graph between all services. You can see the number of edges and how they connect with each other. It’s enormous headache. The problem here is that things are complex. If Steinmetz were to live today and were to solve what is the software equivalent of a generator, I think what he’s going to be facing is, you have these 6-feet tall generators packed into a 40-foot-long corridor, and then you have an entire warehouse of them. That’s not the end, you have three such warehouses connecting to each other, and to generate any electricity at all, you have to power up all three rooms, and they constantly are bustling. When you have that level of complexity, basically, there is no room for magic, because who needs magic when you have systems of complexity instead? The answer is no, we cannot purely do performance engineering relying on knowing everything, and being able to hold the state in our head because reality is far more challenging than that.

The answer basically is treating complexity with the tools that is designed to handle complexity. One of the languages for complexity actually is systems. This is something that has been mentioned over again. I think systems is not mythical, systems essentially is you have a lot of things and those things are connected. System essentially is building a model to describe the relationship between different parts. If you see the fundamental reality through that lens, then what is performance? Performance is merely a counting exercise on top of such relationship. You’re counting how many CPUs or how much memory bandwidth or how many disks you have. You are counting how often are they used, and to what extent are they used. The key here is you need to have the system model in place, so you have the basic structure. We need to count these resources at the right granularity. If you care about request latency, for example, and the request latency tend to happen on the level of milliseconds, then you need to count at those time granularities. Otherwise, the utilization would not reflect what the request is experiencing. All of these things will change over time, and therefore this exercise will have to go on continuously.

Performance as a system property allows us to draw some parallels with other system level properties. I think those are also fairly well understood, for example, there’s security. The intuitive way of thinking about security is that the weakest link determines the overall system security. It’s like the and relationship. Anything that fails means the whole system fails. Availability, on the other hand, is much more forgiving. If any part of that system works, the system is available. Performance is somewhere in the middle. Performance loosely can be seen as the sum of all parts. You add up the cost of your parts, or you add up the latency of all your steps. This means that when we approach performance, we often can trade off. We can say a dollar saved is a dollar saved. It doesn’t matter if it saves here or there, because in the end, they all add up together. Performance also links to some other higher properties, or maybe business level concepts that we care about. Businesses generally care about reliability, they want their service to be available. Performance is that multiplier of that. Better performance means you can handle more with the same resources. Cost, on the other hand, is inversely related to performance. If you have worse performance, if performance is lower, then you have to spend more money to get the same kind of availability. Through these links this is how performance ties itself to actually the overall business objective.

Performance engineering implies that things can be made better, in particular, performance can be made better. Can we make that claim? I really like this book, that’s called, “Understanding Software Dynamics.” In the first chapter, what it establishes is that there is a limit. There is a baseline saying, how slow should it be? For example, you can never violate the law of physics. You cannot run faster than what the CPU can run according to the clock. If there’s a delay because of PCIe, it cannot transfer data any faster than the physical limitation of those links. All of these boundaries, and all of these constraints actually tells us if you do everything right, what can you expect? The other aspect is predictability. If you keep measuring things over again, over time you get a distribution. You can see how bad are things at the tail, how bad are things in the worst cases? Those tend to have an effect when you have a very large infrastructure. On the other hand, you can understand the consistency. This means we’re like, if you measure them over time, this is the behavior you can characterize in aggregate. The TL;DR is this, if you design something really well for performance, I think good performance engineers or good software engineers tend to converge on the thing, maybe the same design that tends to look similar. If you know what good performance looks like, what you can do is you can measure what you actually get. If what you actually get is different from what is good, that delta is the room for optimization. Because such a limit exists and because we know how to measure them, we can do optimization. Another thing I want to mention is just like, there is a structure in performance engineering. Just like building a house, the most important thing is you have the structure right. After that you want to do the plumbing because once you put the drywall on, you cannot go back and change the plumbing very easily. There’s an analogy here, if you want to design a program, how you communicate between different threads between different parts of your program is really important because those are hard to change. The kinds of ideas we associate with performance engineering in anecdotes like changing our inner loop, those are actually the least important, because those are the easiest to change. You can always change one line of code very quickly but you cannot change the structure of a problem. When you do performance engineering, focus on the things that cannot be changed very easily or quickly. Then, eventually you get to those local optimizations.

Performance Engineering at Scale

Now we say we know how to do performance engineering, just in concept. The talk is about how to do performance engineering at Twitter scale. How do we do it at scale? We just basically follow the blueprint, which is, we want to build a model for the system, we want to do counting on that system. What actually happens when we build that model, and when we do the counting at scale, is we translate what is otherwise like a reliability or system man, or magical wizardry engineer looking into an isolated piece of code. If you want to do it at scale, it actually gets translated into a data engineering problem. Let’s use this as an example. If you think about modern software as a local runtime, and software engineers are very good at abstraction. We have all these layers, we have our application code, which sits on top of a bunch of libraries, it may have a runtime like the JVM. Then we have the operating system. Underneath it we have the hardware and then even the network. All these layers of abstraction actually provide the structure of how we would think about software in terms of hierarchy. Once we have this mental model in mind, we can collect these metrics, because in the end, what we care about is how resources are used. You can count these things, at the bottom you have your hardware resources, but they get packaged into all these kernel level syscalls, and other low-level functionalities. Then you can count them to get a higher-level unit that you can do tally on, and so on and so forth until you get all the way to the top of the application.

What we did is, we essentially said, these are the data we need, and these are the data that would apply universally. Now we’re in the domain of essentially data engineering. In data engineering, what you care about are two things. One is where you get the data and what data you’re getting. Number two is, what are you doing with that data? How are you treating them? That’s exactly what we did. When it comes to data generation or signal generation, remember, performance is about counting resources at the right granularity. One thing that is unique about performance is that actually the granularity that performance looks at is actually much finer, generally speaking, with general observability. If you get metrics once every minute, or once every 10 seconds, that’s considered fairly ok as an interval. A lot of the things about performance happens on the level of microseconds or milliseconds. Often, you need to collect signals at those levels. If you have a spike that lasts only 50 milliseconds, you’ll be able to see it. That requires us to write our own systems of samplers that gives us all the low-level telemetries at very high frequency at very low overhead. The way to get there is we actually wrote our own. We heavily used eBPF, which gives a low overhead and also lets us look into the guts of the layers of abstraction and pretty much get any visibility we want. We also have a project called long-term metrics, which is synthesizing pretty much all levels of metrics. You can correspond the request-response latency with very low-level resource utilization. You can see, is this request slowed down due to a spike in CPU usage or something else?

Because we are a performance team, we’re not explicitly a platform team. The point of building all this data infrastructure is to solve real problems. One of the ways we use our data is that we look at everybody’s GC. Then we concluded, what is the right distribution of GC intervals and how often they should be running. Then, we just generated a set of instructions and UI to tell people how to tune their GC automatically. It would compute it for everybody, and you will see the result. We also have fleet health reports showing how much bad hosts is actually becoming outliers and slowing everything down. We also have all these utilization reports and telling people where their utilization is and where they should be and how much they have room to improve. One of the unintended usage actually is people started using our dataset for things that may or may not have to do with performance engineering. Capacity engineers decide they want to validate their numbers using the data we have. Service owners see what we do when it comes to optimizing for certain services. They’re like, we know how to do this, it’s a simple query, we can copy yours. They started doing it on their own and actually saved quite a bit in some cases. Just because we have data in queryable SQL ready state, several teams actually start to prefer the data that we produce, because they’re just easier to use. That is localized data engineering for performance. We took a bit of shortcut here, as you can see, we actually did not really understand the mapping between these layers together. That’s because Twitter applications actually tend to be very homogeneous. We have libraries like Finagle and Finatra which allows us to cut corners, because all applications actually look alike. If you want to understand the heterogeneous set of services, you may need to do things like tracing and profiling, so the relationship between different layers might be more obvious.

Then we say we zoom out. We have all these local instances of applications, but they don’t tell the end-to-end story. When you have a very complex application that has relationships between services, we also need to understand how they talk to each other. This is often in the domain of tracing, or just a distributed system understanding. Here we take the same approach, we say, what is the system here? The system here is you have all these services, and they talk to each other. When they call each other, that’s when they are connected. We capture these edges, and then same thing, we do counting on top of that. Counting here come in very different flavors. You can do counting with time. You can say how much time does this spend on this edge? You can do counting on resources, like how many bytes or what kind of attributes is propagated. Anything you can think of actually can be applied to the edge. The most important thing is the structure, which is this tree-like graph, where all the interactions are happening on top of. We did very similar data treatment. First, we say, let’s improve the signal word gallery. Twitter started the Zipkin project, which later became the standard in OpenTelemetry. All of Twitter services actually came with tracing. Tracing data is particularly fraught with all kinds of data problems. Some of them are inevitable, like clock drift. Some of them are just due to the nature that everybody can do whatever they want, and you end up having these issues like missing a field or having some garbage data in the field. All of those actually requires careful validation and data quality engineering to iron out. This took numerous attempts to get to a state where we can trust. On top of that, we then can build a very powerful trace aggregation pipeline. What it does is, it collects all the traces, instead of looking at them one by one, because nobody can ever enumerate them. You started putting them together and building indices. Indices really is the superpower when it comes to databases. We have indices of traces themselves. We have indices of the edges. We have indices of the mapping between the trees to the edges. All of these data are queryable by SQL.

What we were able to do are things like, how do services depend on each other? This is a screenshot of what we call a service dependency explorer. It tells you who calls who to execute a particular high level, like if you’re visiting Twitter home timeline, how many services are called along on the way, and for each call of the highest level, how many downstream calls are necessary? You instantly get an idea of both the connectivity and also the load amplification between services. We also did a model which is turning into a paper called LatenSeer, essentially allows you to do causal reasoning of who is responsible for the latency that you’re seeing at the high level. The latency propagation or latency critical path is a product. What this allows you to do is you can ask, what if? What if I migrate the service to a different data center? What if I make it 10% faster, does it make a difference overall? All these properties can be studied. We did lots of analysis on-demand, because aggregate tracing often can answer questions that no other dataset can give you answers for. One unintended usage of this dataset is the data privacy team want to know what sensitive information is in the system and who has access to them. They realized the only missing thing they have is they don’t have how services connect to each other. The status that we build mostly to understand how performance essentially propagates through the system ends up being the perfect common ground. They throw away all our accounting, they replace that with the property they care about, but essentially, they kept the structure of the system in place. This allows them to do their performance and privacy engineering with very little upfront effort.

In summary, basically, we spend majority of our time doing data engineering, which is certainly something I was not thinking about when I started, but it made our actual work much easier. We were number one and primary dogfooding customer of our own platform. We were able to do all of these. Each thing I list here on the top, probably is on the order of tens of millions, no big deal. We were able to finish each task in a few months. All of that was possible because we were using the data that we carefully curated. Any question we have is no more than a few queries away. This, essentially, is the power of platform engineering. Towards the end, actually, we started building even more platforms. As I said, we cheated on the software runtime as a system, because we just read the source code and understood what’s going on. That’s not the sustainable way of doing it. We actually very much intended to do program profiling and get the data, they are treated very much the same way as we were treating other data, and turn that into a dataset that anybody can look into. Also, we were thinking about having a performance testing platform. Whatever change you want, you can push through the system and it would tell you immediately what the difference is with any of the other settings you care about. A lot of these things did not happen because we all got kicked out of Twitter.

How Do Performance Engineers Fit In?

That was the technical side of things. To fit in this track, you have to talk about things that are not just technical. Like, how do performance engineers fit into the broader organization if this is the sort of thing we do? The short answer is we don’t fit in. I think the best analogy may be from “Finding Nemo 2.” There’s this octopus that is escaping from the aquarium. I sometimes think of the performance engineering team as the octopus, because we are dealing with system property here. Whatever we do can take us really to any part of the system, and maybe talking to any team in any part of the organization. Whenever executives or management tries to box performance engineering into a particular part of the organization, which is very hierarchical, we somehow will find something to grab on to and then sling ourselves out of the box that they do for us. If you are interested in performance engineering, or really any of the system properties like reliability, security, and you care about having a global impact, actually, you have to deploy a lot of thought into managing the hierarchical org, which is, I think, fundamentally incompatible with the way these properties work. You have to take some first principles approach, figuring out the fundamentals, and then be very clever about it. I think the fundamentals are actually very straightforward. Like performance engineering, you either line up with the top line, which is making more money, or you line up with the bottom line, which is spending less money. Sometimes you need to pick which one is more valuable. Sometimes you will switch between them, but you need to know what you are dealing with. The value has to be associated with one of these. It will affect questions like, who’s your customer. Then, who is your champion, because you need to know who you’re helping. Also, the decision-making structure is really important. If you have a top-down structure of decision making, you find the highest touchpoint you can find and that person is going to give all the commands on your behalf. Then your life is easy. If you have a bottom-up organization, then you need to grow a fan base, essentially, people who like using your thing, and they were the advocate for you on your behalf. Constantly think about like, how to align incentives, how to convince others to do your bidding. One last thing, how to get promoted and recognized because that’s not obvious. If you’re doing this type of work, you are not the same as majority of the rest of the engineering org. Your story has to be strict.

I have some other thoughts. One thing early on that was a little controversial, was that people said if you have a performance engineering team, does that mean people won’t do performance work elsewhere? Are you going to do all the performance work? I think the answer is absolutely no. In any engineering org, there will be lots of often very good engineers who are willing and capable of doing excellent performance engineering work, but they won’t join your team for various reasons. I think the answer is always, you let them. Not only you let them, you help them. The symbolic significance of having a performance engineering team cannot be underestimated, because it means that the org values this type of work. Also, nobody ever shoots a messenger who delivers the news of victory. You can just take out your bullhorn, and give everybody credit, you will still come out as a hero. I think, be a focal point where people can come to you in the right context, say, let’s talk about performance, and they contribute their ideas, they even do the work for you, and then you give them credit, you make them look good. I think this is a great way of having a culture in the end that values this kind of collaborative approach and also values just this type of system thinking and global optimization in general. There’s also a bit of a funny boom and bust cycle going on with performance engineering, which is the opposite of the general business. When the business is good, everything is growing, nobody cares about cost because you’re just floating in money. Nobody cares about performance, and that’s ok. Nobody gets fired when the company is doing amazing. The reverse could be true, the company is really doing poorly, like everybody is so stressed out. At this point you can be the hero. You can say, I know how to make us grow again, I know how to save money. Then now everybody wants to work for you. Take advantage of reverse boom and bust cycle. Do your long-term investment when nobody is paying attention to you, but have all the plans ready when the company is in need, when the business is in need, and you can be there ready to make an impact felt throughout the company.

How It Actually Happened

I presented the things we did, and how we fit in. It’s all nice and neat. I figured this out. As everybody could probably guess, the birth of anything was never this neat. It’s very messy. This is what actually happened. What actually happened is I had no idea back in 2017, like any of this would happen, or how I would go about it. All I had was a niche. I just had my first baby. I was sleep deprived constantly and holding a baby in my hand like 16 hours every day. I was in this dazed moment I started thinking about performance. I had this drifting into these thoughts, like just think about performance in a very deep philosophical way. If you don’t believe this is philosophical, just replace performance with happiness and you’ll see my point. I spent months with it. Then finally a vision occurred. I figured out the ideal performance engineer, who is: knows hardware, good at software, has operations experience, and can do analytics, and can speak business and also deal with any kind of people you will run up and down the org. Might as well walk on water at that point. A unicorn, not just any unicorn, a rainbow unicorn.

Coming back to reality in engineering speak, you need a team. Not just any team, but a diverse team, because it’s very hard for anybody really to check three of those boxes. You have all these fields to carry. Everybody needs to be good at something. It’s very difficult to hire or optimize for six different attributes. I have a simple rule of thumb. My rule of thumb was I need to find people who are much better than me on at least some of the things. I just kept going at it, and it worked out ok. When we first started back in 2017, it was like four people. They were all internal. We were all like your SRE adjacent. We all have this mindset that we need to think about the system property. We started doing Rezolus, which is the performance telemetry work. That’s the one concrete idea we have, otherwise, we’re just figuring out. We did a lot of odd jobs, like talk to a different team, and then say, what’s bothering you? Is there anything else we can do? Someone says, can you install the GPU driver? We’re like, ok, maybe. Then sometimes we run into small opportunities. When I say small opportunities, I mean a couple million dollars here and there. When you have an operational budget of like $300 million a year, you will find million-dollar opportunities under the couch. Then we also did lots of favors, not necessarily things that has anything to do with performance engineering. I was at some point in charge of infrastructure GDPR work, I don’t even know why. It’s something that nobody wanted to do and I had the bandwidth to do it. I did it, and everybody was happy that someone did it. Just justifying that you’re helpful, valuable. Survival mode, I would say. We made lots of mistakes. Certain things, ideas didn’t hold, and we moved on. We wrote those off saying, don’t do this again. Over time, we figure out what is our vision, and whether there’s another methodology, and we wrote it down. Once we had a little bit of trust, people saw us as helpful, we actually got enough small wins to start ramping up. This is where actually a lot of the foundational work started to happen. With the tracing work, with the long-term metrics work, and all these happened because we ended up hiring people who were extremely good at those things. They brought in their vision, and I didn’t tell them what to do, I had no idea. They told me, I’m like, “This makes great sense, please.” Then we dogfooded, and we used our own product like Rezolus to debug performance incidents. We were also looking forward, we were looking at all the crazy hardware things that are happening out there, and then see if any of those would eventually pan out. Speculative investments. We went out and continued to talk to everybody. It was like, can we help you now that we have a little bit of tooling and with a bit of insight, we can answer some of those questions. We did all of those things.

That paid off. At the beginning of the pandemic, there was a big performance capacity crunch. At that point, people started noticing us and they felt the need to come talk to us. That’s when the consulting models start to flip, instead of going out to reach people, people are coming to us. Then we tried to make them happy, and we tried to make our effort impactful. As our primary datasets matured, and we started building our own products on top of our platform, people started to use those products, and people started to understand how they can use the same data to do their own thing when may or may not have to do with performance. Then we got invited to more important projects, mostly to look after the capacity and performance aspects of things. People want to tick a box, say, performance wise, this is fine. We’re often there for it. All of these made me think, this is really wonderful. People know us. We turned a lot of effort into intentionally branding ourselves, like we are the performance people, we are the systems people, we can help you answer performance adjacent questions. Then really put effort into talks and publishing and even writing papers, stuff like that. The team got quite big, and then we’ve got all kinds of talent that we didn’t used to have.

Finally, this would have been a happy ending if we actually executed it. We matured, so we figured out what are the functionalities that are closely adjacent to performance, which is things like capacity, and a lot of the fleet health stuff. The idea was not to grow the performance team just linearly, but instead to start having a cluster of teams that work really well together. The team actually get a little bit smaller, because that’s easier to work with. Nonetheless, 2022 everybody was feeling the crunch of the economic downturn. That’s what I’m talking about. When there’s a drought, performance team could be the rainmaker, because you’re the only one who is sitting on top of a bunch of reserve in terms of opportunities, and then you can really make it better for the business. We went with it. Then all of our pent-up projects were greenlighted. For three months it was glorious, and then Elon Musk almost showed up.

Lessons Learned

What are the lessons? Especially for a team that really deals with the entire organization, I think the technical and social considerations are equally important. There are several people talking about this. In my opinion, I have these principles or methodologies for performance engineering, I think they stood the test of time. One is, we need a pipeline. There’s the opportunity, some of them may not pan out. Some of them may turn out to be not as useful as they want. If you have a lot of opportunities, because we’ve surveyed the entire system, and we measured everything, then eventually you will have a very reliable, consistent pipeline that keeps churning out opportunities that you can go after. For performance teams, usually the best place to look is low-level infrastructure, which affects everybody, or things that takes up like 25%, 30%, 40% of your entire infrastructure. You know who those are. They’re elephants in the room, they’re very obvious. Finally, if we want to scale, then it’s really about creating platforms and products that allow other people to do similar things. They may not be worth your dedicated team’s time, but a local engineer can do that in their spare time. It turned out to be a very decent win.

I think the other organizational lesson is, design the team to fit the organizational structure. Is it top-down? Is it bottom-up? Do it accordingly, don’t copy what we do if your organization is different. Also, outreach is serious work, treat it as such. Adoption is serious work, not just building a product, all of these things, they apply in every sense of that word. Also, people make work happen. It’s really true. Everybody’s strengths and their personality is different, and we have to respect that. I think there’s a huge difference between people who are at the p99 of the talent pool and p50. When you find such a person, really cherish them, because they’re very hard to come by. Seek diversity, both in skills and in perspective, because when we do this kind of diverse work, that is really what’s needed to deliver. Embrace chance. What I mean by that is, had we found a profiling expert, we probably would have built the profiling platform first, and then maybe the metrics platform later. Because the talent was interesting in those things, we’d let them lead us down the path of least resistance. It’s ok, we can always come back later and make up for things we didn’t build in the first place. Don’t have a predetermined mindset saying, this has to happen first. As long as everybody is going in the right direction, the particular path is not that important. Finally, it’s the things my kid’s teacher tell them all the time, be kind to each other. Software engineering is really a social enterprise. To succeed in the long term, we need to be helpful, be generous, make friends, and make it into a culture, everything will be just so much more enjoyable for everybody.

IOP Systems after Twitter

What happened to IOP after Twitter? A bunch of my co-workers found really excellent jobs in all kinds of companies. A number of us banded together and then started doing our own. What we’re doing is not so different from what we used to do at Twitter but now our goal is different. We want to do something for the industry, we want to build something that everybody can use.

Questions and Answers

Participant 1: [inaudible 00:44:36]. What was one of the things that was most surprising on that journey that [inaudible 00:44:47]?

Yue: I think what was a little unexpected might be just how much more valuable understanding the structure of the service and the structure of the distributed system is. Like the index, like this service talks to that service, just that fact that there is a connection of any kind exists between those two, or they are two steps away. That information. It’s not obvious that this is useful for performance. Only after several other steps do we get value out of it. I think if I have to go blindly into a new problem or maybe seeking a new property that is desirable, I would just say, let’s understand the system, regardless of the goal. I think now this is becoming actionable to me. At the time, it was somewhat accidental.

Participant 2: You mentioned that you guys are using eBPF for [inaudible 00:45:59]. What kind of system metrics are you looking for? [inaudible 00:46:06]

Yue: eBPF is like this little bit of a C program you can do that doesn’t have loop and has very limited functionality, you can just execute it in kernel. It’s like god mode, but with limited domain. The metrics we get from the eBPF, we use to gather things we couldn’t get otherwise. If you want to get the I/O size distribution, like I’m going to the disk or I’m going to the SSD, am I getting 4 bytes, or am I getting 16 kilobytes? That distribution is quite important, because that tells you what is the relationship between number of I/O and also the bandwidth they consume. One thing we realized later on is that even for the things you can get traditionally, like the metrics you have in proxies, CPU metrics, sometimes memory metrics, all of these metrics can be obtained by eBPF as well, actually, for far less cost. There’s a reason they have a lot of recent papers at USENIX ATC conference, or USENIX’s OSDI conference. Most of what they did is they do a thing using eBPF in kernel space, and suddenly everything is quantitatively better. We are in this transition of doing as much of the things we want using eBPF. eBPF is written such that you have this basic data structure in kernel space, you would do your count increments there. If you want to report anything, usually you have a companion user program to pull the data out and then send it to your stats agent. We do the standard thing where there’s a user space, and then kernel space duality. That’s how we do it.

Participant 3: You mentioned that [inaudible 00:48:11], dependency graph. How will you eventually store such data, and where? How do you store traces?

Yue: We have a real-time data pipeline doing these things. There’s the hidden assumption that things are partitioned by time. Traces from today is going to be closer together than traces from yesterday and today. Beyond that, I think there’s not an obvious standard for how you would organize them that would answer to every type of query. We basically didn’t bother. We have five different indices depending on what level of information you care about. Once you get to the right index, generally what you would do is you actually scan most of them to filter out the ones you want. The query is not designed to be fast. I think the value is that it can answer questions no other data sources can answer so people put up with the 5-minute delay to execute those queries.

Source link

Tech

How to Preorder the PlayStation 5 Pro in Canada

Published

3 days ago

September 10, 2024

Harry Miller

Sony has made it easy for Canadian consumers to preorder the PlayStation 5 Pro in Canada directly from PlayStation’s official website. Here’s how:

Visit the Official Website: Go to direct.playstation.com and navigate to the PS5 Pro section once preorders go live on September 26, 2024.
Create or Log in to Your PlayStation Account: If you don’t have a PlayStation account, you will need to create one. Existing users can simply log in to proceed.
Place Your Preorder: Once logged in, follow the instructions to preorder your PS5 Pro. Ensure you have a valid payment method ready and double-check your shipping information for accuracy.

Preorder Through Major Canadian Retailers

While preordering directly from PlayStation is a popular option, you can also secure your PS5 Pro through trusted Canadian retailers. These retailers are expected to offer preorders on or after September 26:

Best Buy Canada
Walmart Canada
EB Games (GameStop)
Amazon Canada
The Source

Steps to Preorder via Canadian Retailers:

Visit Retailer Websites: Search for “PlayStation 5 Pro” on the website of your preferred retailer starting on September 26.
Create or Log in to Your Account: If you’re shopping online, having an account with the retailer can speed up the preorder process.
Preorder in Store: For those who prefer in-person shopping, check with local stores regarding availability and preorder policies.

3. Sign Up for Notifications

Many retailers and websites offer the option to sign up for notifications when the preorder goes live. If you’re worried about missing out due to high demand, this can be a useful option.

Visit Retailer Sites: Look for a “Notify Me” or “Email Alerts” option and enter your email to stay informed.
Use PlayStation Alerts: Sign up for notifications directly through Sony to be one of the first to know when preorders are available.

4. Prepare for High Demand

Preordering the PS5 Pro is expected to be competitive, with high demand likely to result in quick sellouts, just as with the initial release of the original PS5. To maximize your chances of securing a preorder:

Act Quickly: Be prepared to place your order as soon as preorders open. Timing is key, as stock can run out within minutes.
Double-Check Payment Information: Ensure your credit card or payment method is ready to go. Any delays during the checkout process could result in losing your spot.
Stay Informed: Monitor PlayStation and retailer websites for updates on restocks or additional preorder windows.

Final Thoughts

The PlayStation 5 Pro is set to take gaming to the next level with its enhanced performance, graphics, and new features. Canadian gamers should be ready to act fast when preorders open on September 26, 2024, to secure their console ahead of the holiday season. Whether you choose to preorder through PlayStation’s official website or your preferred retailer, following the steps outlined above will help ensure a smooth and successful preorder experience.

For more details on the PS5 Pro and to preorder, visit direct.playstation.com or stay tuned to updates from major Canadian retailers.

Tech

Introducing the PlayStation 5 Pro: The Next Evolution in Gaming

Published

3 days ago

September 10, 2024

Harry Miller

Since the PlayStation 5 (PS5) launched four years ago, PlayStation has continuously evolved to meet the demands of its players. Today, we are excited to announce the next step in this journey: the PlayStation 5 Pro. Designed for the most dedicated players and game creators, the PS5 Pro brings groundbreaking advancements in gaming hardware, raising the bar for what’s possible.

Key Features of the PS5 Pro

The PS5 Pro comes equipped with several key performance enhancements, addressing the requests of gamers for smoother, higher-quality graphics at a consistent 60 frames per second (FPS). The console’s standout features include:

Upgraded GPU: The PS5 Pro’s GPU boasts 67% more Compute Units than the current PS5, combined with 28% faster memory. This allows for up to 45% faster rendering speeds, ensuring a smoother gaming experience.
Advanced Ray Tracing: Ray tracing capabilities have been significantly enhanced, with reflections and refractions of light being processed at double or triple the speed of the current PS5, creating more dynamic visuals.
AI-Driven Upscaling: Introducing PlayStation Spectral Super Resolution, an AI-based upscaling technology that adds extraordinary detail to images, resulting in sharper image clarity.
Backward Compatibility & Game Boost: More than 8,500 PS4 games playable on PS5 Pro will benefit from PS5 Pro Game Boost, stabilizing or enhancing performance. PS4 games will also see improved resolution on select titles.
VRR & 8K Support: The PS5 Pro supports Variable Refresh Rate (VRR) and 8K gaming for the ultimate visual experience, while also launching with the latest wireless technology, Wi-Fi 7, in supported regions.

Optimized Games & Patches

Game creators have quickly embraced the new technology that comes with the PS5 Pro. Many games will receive free updates to take full advantage of the console’s new features, labeled as PS5 Pro Enhanced. Some of the highly anticipated titles include:

Alan Wake 2
Assassin’s Creed: Shadows
Demon’s Souls
Dragon’s Dogma 2
Final Fantasy 7 Rebirth
Gran Turismo 7
Marvel’s Spider-Man 2
Ratchet & Clank: Rift Apart
Horizon Forbidden West

These updates will allow players to experience their favorite games at a higher fidelity, taking full advantage of the console’s improved graphics and performance.

Design & Compatibility

Maintaining consistency within the PS5 family, the PS5 Pro retains the same height and width as the original PS5 model. Players will also have the option to add an Ultra HD Blu-ray Disc Drive or swap console covers when available.

Additionally, the PS5 Pro is fully compatible with all existing PS5 accessories, including the PlayStation VR2, DualSense Edge, Pulse Elite, and Access controller. This ensures seamless integration into your current gaming setup.

Pricing & Availability

The PS5 Pro will be available starting November 7, 2024, at a manufacturer’s suggested retail price (MSRP) of:

$699.99 USD
$949.99 CAD
£699.99 GBP
€799.99 EUR
¥119,980 JPY

Each PS5 Pro comes with a 2TB SSD, a DualSense wireless controller, and a copy of Astro’s Playroom pre-installed. Pre-orders begin on September 26, 2024, and the console will be available at participating retailers and directly from PlayStation via direct.playstation.com.

The launch of the PS5 Pro marks a new chapter in PlayStation’s commitment to delivering cutting-edge gaming experiences. Whether players choose the standard PS5 or the PS5 Pro, PlayStation aims to provide the best possible gaming experience for everyone.

Preorder your PS5 Pro and step into the next generation of gaming this holiday season.

Tech

Google Unveils AI-Powered Pixel 9 Lineup Ahead of Apple’s iPhone 16 Release

Published

1 month ago

August 13, 2024

Harry Miller

Google has launched its next generation of Pixel phones, setting the stage for a head-to-head competition with Apple as both tech giants aim to integrate more advanced artificial intelligence (AI) features into their flagship devices. The unveiling took place near Google’s Mountain View headquarters, marking an early debut for the Pixel 9 lineup, which is designed to showcase the latest advancements in AI technology.

The Pixel 9 series, although a minor player in global smartphone sales, is a crucial platform for Google to demonstrate the cutting-edge capabilities of its Android operating system. With AI at the core of its strategy, Google is positioning the Pixel 9 phones as vessels for the transformative potential of AI, a trend that is expected to revolutionize the way people interact with technology.

Rick Osterloh, Google’s senior vice president overseeing the Pixel phones, emphasized the company’s commitment to AI, stating, “We are obsessed with the idea that AI can make life easier and more productive for people.” This echoes the narrative Apple is likely to push when it unveils its iPhone 16, which is also expected to feature advanced AI capabilities.

The Pixel 9 lineup will be the first to fully integrate Google’s Gemini AI technology, designed to enhance user experience through more natural, conversational interactions. The Gemini assistant, which features 10 different human-like voices, can perform a wide array of tasks, particularly if users allow access to their emails and documents.

In an on-stage demonstration, the Gemini assistant showcased its ability to generate creative ideas and even analyze images, although it did experience some hiccups when asked to identify a concert poster for singer Sabrina Carpenter.

To support these AI-driven features, Google has equipped the Pixel 9 with a special chip that enables many AI processes to be handled directly on the device. This not only improves performance but also enhances user privacy and security by reducing the need to send data to remote servers.

Google’s aggressive push into AI with the Pixel 9 comes as Apple prepares to unveil its iPhone 16, which is expected to feature its own AI advancements. However, Google’s decision to offer a one-year free subscription to its advanced Gemini Assistant, valued at $240, may pressure Apple to reconsider any plans to charge for its AI services.

The standard Pixel 9 will be priced at $800, a $100 increase from last year, while the Pixel 9 Pro will range between $1,000 and $1,100, depending on the model. Google also announced the next iteration of its foldable Pixel phone, priced at $1,800.

In addition to the new Pixel phones, Google also revealed updates to its Pixel Watch and wireless earbuds, directly challenging Apple’s dominance in the wearable tech market. These products, like the Pixel 9, are designed to integrate seamlessly with Google’s AI-driven ecosystem.

Google’s event took place against the backdrop of a significant legal challenge, with a judge recently ruling that its search engine constitutes an illegal monopoly. This ruling could lead to further court proceedings that may force Google to make significant changes to its business practices, potentially impacting its Android software or other key components of its $2 trillion empire.

Despite these legal hurdles, Google is pressing forward with its vision of an AI-powered future, using its latest devices to showcase what it believes will be the next big leap in technology. As the battle for AI supremacy heats up, consumers can expect both Google and Apple to push the boundaries of what their devices can do, making the choice between them more compelling than ever.