When I am reborn, I just want to be a top student

Chapter 977 Improved Employee Benefits and a New One-Year Development Plan

Wang Donglai watched Christopher and his group disappear down the corridor from the reception room. Liang Song put the signed letter of intent into a folder and glanced up at him.

Wang Donglai shook his head, sat back down in the main seat, picked up the cup of Longjing tea that had long since gone cold, and took a sip.

The ASMI matter was over for him; the rest was the work of the legal and financial teams, and he didn't need to worry about it anymore.

He put down his teacup, stood up, and said to Liang Song, "Have the instruction set development team prepare. I'll go to the lab this afternoon to check on the progress."

Galaxy Semiconductor is headquartered at the easternmost end of Tangdu High-tech Zone, separated from the lithography factory by only a wide ginkgo avenue.

The instruction set development team's laboratory is on the eleventh floor of the building. The entire floor is designated as a confidential area, and access requires dual authentication.

When Wang Donglai swiped his work permit and passed through the airlock, dozens of people in the laboratory were arguing incessantly in front of a huge whiteboard.

The whiteboard was covered with densely packed architecture diagrams, with different colored markers marking data paths, control units, and cache levels layer upon layer, like a battle map that had been repeatedly scribbled on.

"President Wang is here."

Someone shouted first, and the argument abruptly stopped.

Everyone turned to look at him, their eyes filled with awe, expectation, and barely suppressed anxiety. They had been stuck on this one question for almost two weeks.

"continue."

Wang Donglai pulled up a chair and sat down in the corner of the laboratory, indicating that they didn't need to worry about him.

But no one continued arguing; all eyes were on the gray-haired old engineer next to the whiteboard.

His name is Chen Yuanzhou, the same name as the Chen Yuanzhou who is in charge of the HarmonyOS ecosystem, but he has been deeply involved in the semiconductor field for just as long.

After retiring from the National Institute of Large Scale Integrated Circuits, he was personally invited by Wang Donglai to work at Galaxy Semiconductor, where he was specifically responsible for the research and development of domestic instruction sets.

Chen Yuanzhou put down his marker, turned to Wang Donglai, and said directly, "Mr. Wang, you've come at the right time. We're stuck on the branch prediction module of the instruction pipeline. The prediction accuracy is stuck at around 90%, and we can't go any higher. This figure was top-tier in the industry three years ago, but the current HarmonyOS system has extremely high real-time requirements for chips. The vehicle's infotainment system needs to process LiDAR point clouds and autonomous driving decisions simultaneously, while the mobile phone needs to complete the natural language understanding of the voice assistant within tens of milliseconds. Every time the pipeline makes a prediction error, it has to be cleared and reinstalled, wasting several clock cycles. These wastes accumulate, and the distributed collaborative experience of HarmonyOS will be greatly reduced on low-end chips."

Wang Donglai stood up, walked to the whiteboard, and carefully examined the architecture diagrams for several minutes.

Then he picked up a black marker and drew a minimalist diagram next to the branch prediction module, a tiny auxiliary prediction unit that was separate from the main waterline.

“Your current approach is to pile the branch prediction logic at the front end of the main pipeline and use a deep learning model for pattern recognition. The idea is correct, but the placement is wrong. The main pipeline is too crowded, and the signal keeps going around and around, taking several clock cycles to complete. Separate this auxiliary prediction unit and attach it directly after the instruction fetch stage. It does not participate in any calculations of the main pipeline, but only does one thing: calculates the branch target address one cycle in advance and feeds it to the instruction fetch unit.”

He drew a thick arrow between the auxiliary prediction unit and the instruction fetching unit, and labeled several key delay parameters next to the arrow.

Chen Yuanzhou stared at the arrow for a long time, then suddenly took off his reading glasses and wiped the lenses with his sleeve.

He repeated this action twice, then put his glasses back on, picked up a red marker, and quickly wrote a set of mathematical formulas next to the auxiliary prediction unit.

The scratching sound of the pen tip across the whiteboard grew faster and faster. When he wrote the last equal sign, his hand trembled slightly, not from nervousness, but from excitement.

"By separating the prediction logic and running it independently, the main pipeline doesn't need to wait, and instruction fetching doesn't need to be contested. Mr. Wang, your approach isn't optimization; it's a direct architecture overhaul. But how do we solve the data synchronization problem between the auxiliary units and the main pipeline? No matter how fast the prediction unit runs, if it's out of sync with the state of the main pipeline, the prediction results are useless."

Wang Donglai added a dashed line between the auxiliary unit and the main waterline, and labeled the timing diagram of synchronous latching next to it.

"Use an asynchronous FIFO buffer. The depth doesn't need to be large; just enough to store two prediction results is sufficient. The prediction unit calculates in advance and then puts the result in, while the instruction fetch unit retrieves it automatically at the designated time. The main pipeline never waits for the prediction, and the prediction never drags down the main pipeline. It uses asynchronous clock domains, so each runs its own program. Nuwa used a similar asynchronous buffer approach when designing the HarmonyOS kernel scheduler. The depth parameter of the FIFO can be directly adjusted."

Chen Yuanzhou placed the red marker in his hand in the whiteboard tray, took a few steps back, looked at the densely drawn architectural diagram, and remained silent for a long while.

Then he turned to a young man wearing glasses in the team and said, "Xiao Liu, build a prototype of the auxiliary prediction unit that Boss Wang just drew in Verilog. Set the clock constraint according to the asynchronous FIFO scheme. Run the simulation as soon as you finish building it today. Boss Wang, if this version runs smoothly, the prediction accuracy should be able to be improved by several percentage points."

He added with certainty, "It's not a linear improvement, it's a direct elimination of prediction latency. If this branch prediction logic works, HarmonyOS's real-time performance on low-end chips will at least catch up with the current level of mid-to-high-end chips."

Wang Donglai nodded without saying anything more.

He stayed in the lab for a while longer, reviewing the optimization schemes for the cache coherency protocol and the clock gating design in low-power mode, and made several adjustment suggestions before leaving the instruction set lab.

The lights in the corridor were dimly lit, with only the emergency lights on.

He stepped into the elevator and pressed the button for another floor, the floor where the AI ​​chip R&D team was located.

The atmosphere in the AI ​​chip lab was even more somber than on the instruction set lab.

Several test boards were laid out on the long table, each board having a different version of the AI ​​acceleration core soldered on it.

Next to the test board was a thick stack of power consumption curve reports, the footer of each page was worn and frayed.

The project leader, surnamed Zhou, is a senior architect poached from NVIDIA. His hair is mostly white, but his eyes are extremely sharp.

"Mr. Wang."

Engineer Zhou led him to the main test bench, where a power consumption curve of an AI inference task was running on the screen.

"Our current AI acceleration core, based on the traditional SIMD architecture, has caught up with NVIDIA's equivalent products in image recognition and natural language processing, but its power consumption remains high. This is mainly due to the excessively frequent data transfer. Each layer of the neural network has to repeatedly load weights from external DRAM, and each load consumes more energy than a single calculation. If this problem isn't solved, our AI chip can only be used on the server side, and cannot be integrated into in-vehicle systems or mobile phones." He broke down the power consumption curve layer by layer, marking the corresponding data transfer volume on the screen for each layer.

From convolutional layers to fully connected layers and then to attention mechanisms, the peak of the transport volume increases, and the whole image looks like a tilting wall.

Wang Donglai did not answer directly.

He walked to the whiteboard, picked up a marker, and drew a completely new architectural sketch.

It is not a traditional SIMD array, but a hybrid-granularity tensor computation unit, with coarse-grained processing for large-scale matrix multiplication and fine-grained processing for attention computation after sparsification.

Both share the same set of on-chip caches, but their scheduling logic is separate.

“Traditional GPUs use SIMD to stack computing power and crush neural networks with brute force calculations. But the bottleneck for AI inference tasks is not computing power, but data handling. The weights of each layer of the neural network have to be moved from external memory into the computing unit, and the energy consumed by moving them once is more than that of calculating once. Your solution uses a large-capacity on-chip cache to reduce the number of times the data is moved, which is the right direction, but the capacity of the on-chip cache is ultimately limited. No matter how large the cache is, it cannot hold the weights of the entire GPT model.”

He added a few strokes to the architecture diagram of the hybrid granular tensor computing unit, and added a minimal data compression engine between the on-chip cache and external DRAM.

"When data enters or leaves the on-chip cache, we add a layer of hardware compression and decompression logic. It's not software compression; it's a dedicated compression engine built directly onto the silicon chip. The weights of the neural network itself have a lot of redundancy. After sparsification, most of the weights are zero, and the non-zero parts also have strong regularity. By using lightweight differential coding to compress the weight stream to a fraction of its original size, the amount of data transported is reduced accordingly."

Engineer Zhou stared at the architecture diagram for a long time, his eyes growing brighter and brighter.

He picked up a red marker and wrote a few lines next to the data compression engine: differential coding, zero-value compression, adaptive quantization. Each line represented a cutting-edge direction in the field of hardware compression, but very few companies had actually made it into silicon chips.

After watching for a while, Engineer Zhou asked a crucial question: Compression and decompression themselves incur latency overhead. If the accumulated latency exceeds the idle window of the computing unit, the overall inference time will actually be prolonged.

Wang Donglai's answer was even more decisive: the compression/decompression logic and the computation pipeline are parallelized; instead of decompressing first and then computing, they are decompressed and computed simultaneously. Once a data block is decompressed, it is immediately pushed into the computation pipeline, without waiting for the entire batch to complete decompression. The latency overhead is absorbed by the throughput of the computation pipeline, resulting in reduced net power consumption and no increase in net latency.

He wrote the last line on the whiteboard, then turned around and put the marker back in the slot.

Engineer Zhou stared at that line of text in silence for a long time.

The power consumption curves on the test bench were still running, slowly fluctuating on the screen. The computational power consumption of each layer of the neural network was broken down into two parts: data transfer and matrix operations, with the former accounting for an astonishingly high proportion.

He knew that if the architecture proposed by Wang Donglai could be successfully implemented, China's AI chips would achieve a leap from nothing to something in both vehicle and mobile devices.

He turned to his team and said, "Make Mr. Wang's hybrid granularity computing unit solution into an RTL-level simulation and get the results within three days. Also, inform Mr. Liang that the AI ​​chip tape-out schedule is being moved up, and the lithography factory needs to schedule a separate production line."

The sound of keyboards clicking filled the laboratory.

Several young engineers gathered around the test bench, discussing the encoding scheme of the hardware compression engine. Someone pulled up a report on the sparsity analysis of neural network weights that Wa had done before, and marked the compression ratio of differential encoding layer by layer.

The young engineer who had jumped ship from Nvidia with Zhou Gong stared at the screen for a long time before suddenly saying, "If we really make this, Nvidia's GPUs won't be able to compete with us in edge inference anymore. It's not a price war; it's taking a shortcut directly from the architecture. They're still using SIMD to stack computing power, while we've already switched to dataflow-driven tensor computing."

Engineer Zhou did not answer.

He just stared at the densely drawn architecture diagram on the whiteboard and recalled how he had repeatedly advocated for using near-memory computing to develop edge AI chips during his time at Nvidia, but each time it was suppressed by the higher-priority data center GPU projects.

Now he sits in Tangdu's laboratory, working on this again with a group of young people. Finally, someone is willing to seriously continue down this path.

The next day, Wang Donglai sat in his office reviewing the initial version of the RTL simulation data that the AI ​​chip team had worked on overnight.

The power consumption curve slopes smoothly down on the screen, and the scheduling latency of the hybrid granularity computing unit is better than the design specifications.

He picked up a stylus and wrote two lines on the report: the tape-out node should be brought forward, the lithography factory should schedule a separate production line, and priority should be given to ensuring the delivery of the first batch of engineering samples of AI chips. Then he pushed the report to Wa for archiving.

"Wa, retrieve the current total number of employees and business distribution of Galaxy Group."

A set of data suddenly popped up on the screen.

The total number of employees has just exceeded 1.01 million, distributed across core business lines such as Xinghuo Express, Pinhaofan, Yinhe Supermarket, Yinhe Energy, Yinhe Semiconductor, Yinhe Aerospace, and Yinhe Biotechnology, as well as talent apartments, community canteens, production line training centers, and newly completed childcare centers for employees' children across the country.

With over a million employees, it's already the first private enterprise in China to achieve this scale.

Even compared to state-owned enterprises, it's not much different.

Furthermore, it's obvious to anyone with eyes to see that, given Galaxy Technology's growth momentum, its workforce will definitely continue to increase.

For example, there are Galaxy supermarkets, Galaxy agriculture, Galaxy home appliances, and so on that are springing up everywhere.

"Draft a salary and benefits adjustment plan, requiring a general increase in base salary for all employees, with a minimum increase of no less than 10%. Meal allowances, housing allowances, and transportation allowances will be optimized simultaneously. In addition, long-term incentives for core positions will be doubled. The specific plan should be discussed with the heads of each department and summarized to me within one day. I need it tomorrow."

"The construction progress of the talent apartments and employee children's care center should be reported separately. A special fund should be allocated from the president's reserve fund, bypassing the regular budget approval process. You can draft the specific adjustment plan and send it directly to the heads of relevant departments for confirmation."

"Finally, we will prepare an annual budget. Next year, we will expand the scale of Galaxy Education and establish a complete chain from kindergarten to primary school to middle and high school. Initially, we will mainly rely on group employees. We will focus on creating high-quality products. We can lose money in the early stages, and later control the profit margin to around 8%."

"We will increase the planting scale of Yinhe Agriculture and support the new varieties developed by Yinhe Biotechnology to carry out high-quality and high-level planting."

Wang Donglai spoke very quickly, and Wa didn't miss a single detail, immediately using supercomputing power to make work arrangements. (End of this chapter)

Tap the screen to use advanced tools Tip: You can use left and right keyboard keys to browse between chapters.

You'll Also Like