Why did Cloudwatch stop logging Sagemaker?

Question

I have a Sagemaker instance running for a while now. I didn't change anything in between, but now I can't see new logs on Cloudwatch anymore. The old logs are still there, but no new ones since 2 days.

The Sagemaker instance is still running. It's just not logging anymore. And as the code didn't change and I don't have anything time-dependant in there, I'm pretty sure I hit a limit. But I don't know which one:

The Log group has only one log stream
The single log stream has a size of 175MB.

I found CloudWatch Logs Limits and CloudWatch Events Limits, but that didn't help me.

What could be the problem? How can I investigate it?

According to AWS docs this should not happen. The general AWS support did not help.

I have not worked with SageMaker but I can still give you some pointers which should help debug this. I assume you can get into the EC2 machine for the same. See this before starting docs.aws.amazon.com/AmazonCloudWatch/latest/logs/…. First I would run sudo systemctl status awslogsd to make sure its running. Next I would make sure that the policy to "arn:aws:logs:*:*:*" is still active. Next I would run journalctl -u awslogsd to see if I find any issues in logs of awslogsd. Next if nothing yields I would run journalctl -f and monitor anything in logs — Tarun Lalwani
– Tarun Lalwani, Commented Apr 23, 2018 at 14:09
I don't think I can login to Sagemaker with a shell.. or at least I don't know how. — Martin Thoma
– Martin Thoma, Commented Apr 23, 2018 at 14:51
There is an option for S3 logs also I believe? Also can you see if some policy issue? — Tarun Lalwani
– Tarun Lalwani, Commented Apr 23, 2018 at 18:30
I don't know if I can see if there is a policy issue. The point is that it was running for quite a while. The change was surprising to me and I don't think anything was changed on my side. — Martin Thoma
– Martin Thoma, Commented Apr 23, 2018 at 19:18
Do you have any visibility into what the log files actually look like? Are they rotated and at what frequency? CloudWatch logs agent will ignore a rotated file if the first line (by default) is the same as in the previous file. Can you see what the log files look like and what the CloudWatch logs configuration is? — Dejan Peretin
– Dejan Peretin, Commented Apr 29, 2018 at 9:19

Leopd · Accepted Answer · 2018-10-26 16:13:35Z

1

First, it doesn't sound like you're doing anything wrong. Logs should just show up in CloudWatch without you having to do anything, without size or time limits. If they start at all, then we know permissions were set up properly -- unless you modified IAM in the middle of the run. If the logs stop mid job, then either the actual job stopped outputting to stdout/stderr for some reason or this is an operational glitch with the service's log processing. Contacting AWS support (here, in the AWS forums, or through tech support) is the right way to deal with this -- giving somebody in AWS the account id and job name will enable them to look into exactly what happened.

Also, sorry this has gone unanswered for so long. Judging by the activity here, it seemed like a lot of people might have hit this problem. But I'm also guessing & hoping that the problem was a temporary internal service glitch that has been resolved. If anybody is still seeing this problem (after October 2018), please leave a comment so we know it still needs attention. Or better yet open a new question (not ideal from an SO perspective, but that's more likely to get somebody's attention at AWS).

Thanks for using Amazon SageMaker, and thanks for the feedback!

-An AWS employee

answered Oct 26, 2018 at 16:13

Leopd

42.9k32 gold badges138 silver badges173 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Martin Thoma Over a year ago

I asked AWS support, but they were not helpful. They send me a couple of links which basically said that AWS takes care of the logging. After mentioning that this is likely a bug of AWS, they only replied that they are not technical support (which I didn't book). Later, I think I found the problem: I had a lot of identical log messages. Somehow this seems to have caused issues (although I could not see that I hit any limit). Adding a timestamp to each message and making the logging less verbose solved it for me (for now - not sure if this will re-occur)

Leopd Over a year ago

Does it still repro? And sorry you couldn't get the help you needed at the time -- AWS forums are sometimes a better way to get attention of technical folks, but we're working on watching SO more closely.

Denys Kovalenko · Accepted Answer · 2021-04-20 08:41:14Z

0

I encountered this problem multiple times. It's possible that new LogStream wasn't created after endpoint update (which can be triggered by you, or AWS restarting/updating underlying instances). You should see logStream for every instance that runs/used to run on your endpoint.

Unfortunately, the only way to mitigate it for me was to update endpoint (apply identical EndpointConfiguration that uses same model, for example), basically triggering recreation of instances, and their log streams

answered Apr 20, 2021 at 8:41

Denys Kovalenko

16

Collectives™ on Stack Overflow

Why did Cloudwatch stop logging Sagemaker?

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related