22 october 2022
How Azure works - what the architecture and approaches behind cloud computing might be
In continuation of my thoughts about the internal structure of clouds, today I would like to reflect on how it is generally they work inside from the point of view of IT engineering. Let's try to think of what layers and blocks Microsoft Azure can consist of.

We all know that clouds are divided into IaaS, PaaS and SaaS

(took from here)
At the bottom, we have the hardware, above there is a virtualization level and then the operating system, runtime environment, etc. And if everything becomes clear from the level of the operating system, then what happens below is often shrouded in mystery. No, well, in general, servers, storage, networks - this is not something new, but how it all turns into virtual machines together is a big question.

Let's discuss together how it can work. I’m not a big specialist in virtualization issues, so everything below is purely my assumptions.

So. Let's say we have a large server. It can consist of several "compute units" that contain many processors and a lot of memory. Disk drives, for example, are installed in a separate “storage unit” and shared somehow between compute units. Also, there should be more "network units" for networking. And of cause a block with a GPU. Plus all this must be powered, but let the power supplies be built into the functional blocks. This seems to be the minimum required set for the server to work.

In fact, this division into units is purely logical. Physically, all this can be either one board in one case (similar to how a desktop or laptop is arranged) or a lot of boards combining logical elements in one or another way. And usually, all this is physically placed in a server rack. Well, there can be a lot of such racks. So it might look like this


took from here
or like this
took from here
Do you feel it's beauty? ????

So the scheme of the hardware part should look like this:
As a result, we have a very powerful computer that needs to have an operating system installed on it, which would allow us to utilize all of our hardware resources. In general, the type of operating system is not that important here because essentially it will act as a layer between our hardware resources and the special software that would enable the operation of virtual machines. However, personally, I would choose to install Linux :)

Let's call this software - a virtualization system. There are quite a few of such systems, but in my opinion, the most well-known ones are VMWare, Hyper-V, and KVM. If you're interested, you can learn more about them (there is a big world of extremely interesting technologies on how virtualization works with hardware resources), but I will focus on the fact that these systems can do the most important thing for us - create virtual machines, using the hardware resources of the server on which they are running. And not just create them, but do so in a way that virtual machines do not know about each other and cannot "intrude" into each other's resources, even though they will actually be executed on the same CPUs and share the same RAM. This is the main "pillar" on which all clouds are based. In reality, of course, everything is more complicated, and there will be a lot more things running on in the server's operating system, but we will focus on the most important aspects.

In Azure, with its scale, a combination of virtualization systems is most likely used. It would be strange if Microsoft itself did not use Hyper-V, but it would also be strange to ignore the benefits of KVM :)

So, virtualization systems are capable of creating, modifying, and destroying virtual machines. But who and how will tell them when and which virtual machine to create and on which server? This is where the main components of the cloud come into the game: compute, network, and storage orchestrators. Why specifically these? Because they are the three most important components of the IaaS layer. (Even GPUs here look like a small option :)

  • The Compute orchestrator manages virtual machines on which literally everything runs in the cloud.
  • The Network orchestrator manages routers that provide physical network connectivity between servers, racks, data centers, etc. This orchestrator also manages load balancing and traffic routing issues.
  • The Storage orchestrator manages storage arrays, allocating resources for virtual machines and higher-level services (such as Azure Storage).

An orchestrator is already something like "cloud" services within a "cloud" service. It must have an internal endpoint, some authentication system, and a REST API so that it can be asked to execute a particular command. But where will the orchestrator store the current configuration of the resources? Of course, it could be stored in some blob inside the storage array, but I would choose a very reliable database that can scale at the level of the entire cloud. Let's assume it is an internal installation of Cosmos DB.

But now we have another question. Where does all this work? This is where the term Dogfooding comes in. The idea is that inside your system, you should use components of the system itself. In other words, if we allow our users to work with virtual machines, why not use the same virtual machines for our own needs? Therefore, let's assume that all internal services (orchestrators and metadata storage) work on regular virtual machines created by the virtualization system that they are designed to manage.

Thus, our scheme now is slightly complicated:

Looks like it is already possible to work with this. We have services that we can ask to create virtual machines, set up a network and allocate some space for us to store data. But how do we determine who has the right to issue commands and who does not? That's right, you need a service that would be responsible for Identity and Access management. In Azure, a whole combination of entities works here - Azure Active DIrectory, RBAC, various kinds of identity, etc., but for simplicity, we will combine it all into one. And we will also add a service for collecting monitoring information (otherwise, how can we understand what is happening inside our cloud and how to evaluate who spent how much resources), and a service that will calculate how much money our customers have to pay us.

In addition, personally, I would add another service that centrally manages access to the REST API for both internal and external services of our cloud in accordance with the rules established by our Identity and Access management service.

So now we have even more complex structure.

I may have missed something, but it seems that in general, this is enough for us to add a lot of “external” services for users on top (for example, PaaS databases, various types of storage, serverless, IoT, etc.).

Let's see if we meet all the public cloud criteria from the NIST definition:

On-demand self-service. A consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with each service provider.
Yes. Thanks to the API service, any user can centrally and independently manage our resources.

Broad network access. Capabilities are available over the network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, tablets, laptops, and workstations).
Yes, our Network orchestrator provides cross-platform connectivity of all resources in the user

Resource pooling. The provider's computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand.
Yes, our Compute and Starage orchestrators ensure that a single Hardware Layer is shared by all users

Rapid elasticity. Capabilities can be elastically provisioned and released, in some cases automatically, to scale rapidly outward and inward commensurate with demand. To the consumer, the capabilities available for provisioning often appear unlimited and can be appropriated in any quantity at any time.
Yes, thanks to our orchestrators and virtualization system, we can change the configuration on the fly.

Measured service. Cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported, providing transparency for both the provider and consumer of the utilized service.
Yes, Logging, monitoring and billing systems allow us to implement the Pay As You Go model for our users.

Thus, that’s it. We have created a real cloud in our minds. Of course, it is much more complicated inside and we didn’t even touch on how PaaS services work, for example, but I hope that the main approaches that are used in the clouds are now clearer to you.

For now this is all. In the next article we will dig a little deeper into the work of PaaS services.