AWS re:Post

How do I troubleshoot a Windows WorkSpace that's marked as unhealthy?

My Amazon WorkSpaces Windows WorkSpace is marked as unhealthy. How can I fix this?

Short description

The WorkSpaces service periodically checks the health of a WorkSpace by sending the WorkSpace a status request. A WorkSpace is marked as unhealthy if the WorkSpaces service doesn't receive a response in a timely manner.

Common causes for this issue are:

  • An application on the WorkSpace is blocking the network connection between the WorkSpaces service and the WorkSpace.
  • High CPU usage on the WorkSpace.
  • The agent or service that responds to the WorkSpaces service isn't running.
  • The computer name of the WorkSpace changed.
  • The WorkSpaces service is in a stopped state or is blocked by antivirus software.

Try the following troubleshooting steps to return the WorkSpace to a healthy state:

Reboot the WorkSpace

First, reboot the WorkSpace from the WorkSpaces console .

If rebooting the WorkSpace doesn't resolve the issue, connect to the WorkSpace by using a Remote Desktop Protocol (RDP) client .

If the WorkSpace is unreachable by using the RDP client, follow these steps:

  • Restore the WorkSpace to roll back to the last known good snapshot.
  • If the WorkSpace is still unhealthy, rebuild the WorkSpace .

If you can connect to your WorkSpace using RDP, then verify the following to fix the issue:

Verify CPU usage

Open the Windows Task Manager to determine if the WorkSpace is experiencing high CPU usage. If it is, try any of the following troubleshooting steps to resolve the issue:

  • Stop any service that's consuming high CPU.
  • Resize the WorkSpace to a compute type that's greater than what is currently used.
  • Reboot the WorkSpace.

Note : To diagnose high CPU usage, see How do I diagnose high CPU utilization on my EC2 Windows instance when my CPU is not being throttled?

Verify the WorkSpace's computer name

If you changed the computer name of the WorkSpace, change it back to the original name:

  • Open the WorkSpaces console, and then expand the unhealthy WorkSpace to show details.
  • Copy the Computer Name .
  • Connect to the WorkSpace using RDP.
  • Open a command prompt, and then enter hostname to view the current computer name. If the name matches the Computer Name from step 2, skip to the next troubleshooting section. If the names don't match, enter sysdm.cpl to open system properties. Then, follow the remaining steps in this section.
  • Choose Change , and then paste the Computer Name from step 2.
  • If prompted, enter your domain user credentials.

Confirm that the WorkSpaces services are running and responsive

If WorkSpaces services are stopped or aren't running, then the WorkSpace is unhealthy. Follow these steps:

  • From Services , verify that the WorkSpaces services named SkyLightWorkspacesConfigService , WSP Agent (for the WorkSpaces Streaming Protocol [WSP] WorkSpaces), and PCoIP Standard Agent for Windows are running. Be sure that the start type for both services is set to Automatic . If any of the three services aren't running, start the service.
  • Verify that any endpoint protection software, such as antivirus or anti-malware software, explicitly allows the WorkSpaces service components .
  • If the status of the three services is Running , then the services might be blocked by antivirus software. To fix this, set up an allow list for the locations where the service components are installed. For more information, see Required configuration and service components for WorkSpaces .
  • If WorkSpaces Web Access is turned on for the WorkSpace, verify that the STXHD Hosted Application Service is running. Make sure that the start type is set to Automatic .
  • Verify that your management adapter isn't blocked by any application or VPN. Then, ensure that proper connectivity is in place.

Note : If WorkSpaces Web Access is turned on but not in use, update the WorkSpaces directory details to turn off WorkSpaces Web Access. The STXHD Agent can cause an unhealthy WorkSpace.

Verify firewall rules

Important : The firewall must allow listed traffic on the management network interface .

Confirm that Windows Firewall and any third-party firewall that's running have rules to allow the following ports:

  • Inbound TCP on port 4172 : Establish the streaming connection.
  • Inbound UDP on port 4172 : Stream user input.
  • Inbound TCP on port 8200 : Manage and configure the WorkSpace.
  • Inbound TCP on ports 8201–8250 : Establish the streaming connection and stream user input on WSP.
  • Outbound UDP on ports 50002 and 55002 : Video streaming.

If your firewall uses stateless filtering, then open ephemeral ports 49152–65535 to allow for return communication.

If your firewall uses stateful filtering, then ephemeral port 55002 is already open.

Related information

IP address and port requirements for WorkSpaces

Turn on self-service WorkSpace management capabilities for your users

Required configuration and service components for WorkSpaces


Amazon WorkSpaces client network requirements

Your WorkSpaces users can connect to their WorkSpaces by using the client application for a supported device. Alternatively, they can use a web browser to connect to WorkSpaces that support this form of access. For a list of WorkSpaces that support web browser access, see "Which Amazon WorkSpaces bundles support web access?" in Client Access, Web Access, and User Experience .

A web browser cannot be used to connect to Amazon Linux WorkSpaces.

Beginning October 1, 2020, customers will no longer be able to use the Amazon WorkSpaces Web Access client to connect to Windows 7 custom WorkSpaces or to Windows 7 Bring Your Own License (BYOL) WorkSpaces.

To provide your users with a good experience with their WorkSpaces, verify that their client devices meet the following network requirements:

The client device must have a broadband internet connection. We recommend planning for a minimum of 1 Mbps per simultaneous user watching a 480p video window. Depending on your user-quality requirements for video resolution, more bandwidth might be required.

The network that the client device is connected to, and any firewall on the client device, must have certain ports open to the IP address ranges for various Amazon services. For more information, see IP address and port requirements for WorkSpaces .

For the best performance for PCoIP, the round trip time (RTT) from the client's network to the Region that the WorkSpaces are in should be less than 100ms. If the RTT is between 100ms and 200ms, the user can access the WorkSpace, but performance is affected. If the RTT is between 200ms and 375ms, the performance is degraded. If the RTT exceeds 375ms, the WorkSpaces client connection is terminated.

For the best performance for WorkSpaces Streaming Protocol (WSP), the RTT from the client's network to the Region that the WorkSpaces are in should be less than 250ms. If the RTT is between 250ms and 400ms, the user can access the WorkSpace, but the performance is degraded.

To check the RTT to the various Amazon Regions from your location, use the Amazon WorkSpaces Connection Health Check .

To use webcams with WSP, we recommend a minimum upload bandwidth of 1.7 megabits per second.

If users will access their WorkSpaces through a virtual private network (VPN), the connection must support a maximum transmission unit (MTU) of at least 1200 bytes.

You cannot access WorkSpaces through a VPN connected to your virtual private cloud (VPC). To access WorkSpaces using a VPN, internet connectivity (through the VPN's public IP addresses) is required, as described in IP address and port requirements for WorkSpaces .

The clients require HTTPS access to WorkSpaces resources hosted by the service and Amazon Simple Storage Service (Amazon S3). The clients do not support proxy redirection at the application level. HTTPS access is required so that users can successfully complete registration and access their WorkSpaces.

To allow access from PCoIP zero client devices, you must be using a PCoIP protocol bundle for WorkSpaces. You must also enable Network Time Protocol (NTP) in Teradici. For more information, see Set up PCoIP zero clients for WorkSpaces .

For 3.0+ clients, if you are using single sign-on (SSO) for Amazon WorkDocs, you must follow the instructions in Single Sign-On in the Amazon Directory Service Administration Guide .

You can verify that a client device meets the networking requirements as follows.

Open your WorkSpaces client. If this is the first time you have opened the client, you are prompted to enter the registration code that you received in the invitation email.

Depending on which client you're using, do one of the following.

The client application tests the network connection, ports, and round-trip time, and reports the results of these tests.

Close the Network dialog box to return to the sign-in page.

Choose Network in the lower-right corner of the client application. The client application tests the network connection, ports, and round-trip time, and reports the results of these tests.

Choose Dismiss to return to the sign-in page.

AWS - How to handle global "Round Trip Time"?

Hey serverfault people,

Image a generic "Software as a Service" company offering a service running on AWS (hey, that's us). There is no rocket science involved, standard web-application doing its thing as usual and an end-user smartphone app. As customers are from Europe , naturally the AWS eu-central-1 region is containing everything for multiple tenants.

Now Sales manages to win a customer from Australia - all good so far, as the web-application can handle different timezones, currencies and locales already. But: Australia as far away as you can get from Europe (at least on earth), and so quite some round trip time is now involved. Per request we do see roughly 300ms - 400ms extra per direction (EDIT: this is wrong when speaking about RTT as pointing out in the commends, we do see 2x400ms = 800ms extra for the first HTTPS request).

For the mentioned web-application, which is used by the customer for management purposes, its totally fine. The rendered HTML is there a bit later but thanks to CDNs (CloudFront), assets are not an issue.

But the end-user smartphone application, which does smaller but more JSON requests, is affected. There it feels at the edge of "OK-ish" but definitively not snappy.

Now the question is: how to improve the timings from an end-user perspective? We already thought about a few options here:

Clone the complete software and host it in AWS ap-southeast-2 as well

Benefit: awesome performance, easy to setup, CI/CD would allow deploying the same code simultaneously in EU and AU.

Drawbacks: we have to maintain and pay for two identical infrastructure sets, data can not be shared easily, lots of duplication in all terms.

Move only computation instances to AWS ap-southeast-2

Nope, will not work as database or redis queries would be affected by the round trip time even more.

Have a read only replica in AWS ap-southeast-2 and do writes in eu-central-1

Better as option 2, but adds lot of complexity in the code plus the number of writes is not that that few usually.

Spin up a load balancer in AWS ap-southeast-2 and peer connect the VPCs

Idea: users connect to the AU endpoint and traffic is going via beefy connection to the EU instances. However, we this would obviously not reduce the distance and we are unsure about the potential improvement (if any?)

Does anybody have experienced a similar issue and is willing to share some insights?

Update: it seems only the first HTTPS request seems to be very slow. While digging into AWS Load Balancer options, I also noticed that AWS Global Accelerator might help, so we did some tests.

From local system (in EU):

From AU (EC2 instance):

From AU to AWS Global Accelerator(EC2 instance):

In a nutshell: It seems the TLS handshake is causing the biggest initial latency. If it can be reused however, the extra time for AU to EU seems really "just" ~277ms (0,294524s - 0,017285s) for Time To First Byte.

  • Regarding 300ms - 400ms extra per direction , that sounds strange. I would expect the full RTT to be in that range (well, I see 250-300ms RTT to Sydney hosts but depending on where in Australia it will obviously vary... but not double as you indicated). Regarding option 4, if this is about the latency it will not really matter much (while the routing will be slightly different most of that distance is inherent, and as you noted it's really the distance that adds to the latency). –  Håkan Lindqvist Commented Jul 9, 2021 at 17:22
  • To reduce latency you need application and database in Sydney. I like #3, alter your application to use a read replica for reads and send writes to the master EU database, so long as it will actually have benefits. Otherwise you'll need the full stack in Sydney. –  Tim Commented Jul 9, 2021 at 22:00
  • @HåkanLindqvist you are absolutely right! I measured a full HTTPS request and decided it by 2, that's not the RTT. –  Markus Commented Jul 11, 2021 at 12:52
  • The too many writes part may well be insignificant compared to modern browsers ability to shave off round trips. You may want to measure HTTP/1.1, HTTP/2, HTTP/3, 0-RTT & full-handshake separately to confirm that you really do need the database closer to your users, as opposed to, say, wait for old smartphones and MSIE to get replaced. –  anx Commented Jul 11, 2021 at 13:16

