this post was submitted on 06 Sep 2023

1080 points (99.4% liked)

Technology

59672 readers

2852 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

[email protected]

1080

All of Japan's Toyota Assembly Plants Shut Down for a Day Because Their Server Ran Out of Disk Space (www.reuters.com)

submitted 1 year ago by [email protected] to c/[email protected]

115 comments fedilink hide all child comments

top 50 comments

sorted by: hot top controversial new old

[–] [email protected] 330 points 1 year ago (4 children)

I haven't read the article because documentation is overhead but I'm guessing the real reason is because the guy who kept saying they needed to add more storage was repeatedly told to calm down and stop overreacting.

[–] [email protected] 169 points 1 year ago (7 children)

I used to do some freelance work years ago and I had a number of customers who operated assembly lines. I specialized in emergency database restoration, and the assembly line folks were my favorite customers. They know how much it costs them for every hour of downtime, and never balked at my rates and minimums.

The majority of the time the outages were due to failure to follow basic maintenance, and log files eating up storage space was a common culprit.

So yes, I wouldn't be surprised at all if the problem was something called out by the local IT, but were overruled for one reason or another.

[–] [email protected] 59 points 1 year ago (2 children)

and log files eating up storage space was a common culprit.

Another classic symptom of poorly maintained software. Constant announcements of trivial nonsense, like [INFO]: Sum(1, 1) - got result 2! filling up disks.

I don't know if the systems you're talking about are like this, but it wouldn't surprise me!

[–] [email protected] 38 points 1 year ago (1 children)

You gotta forward that to Spunk so your logs ain't filling up the server generating them. Plus you can set up automated alerts for when the result stops being 2.

This message brought to you by Big Splunk.

[–] [email protected] 25 points 1 year ago (2 children)

I think you missed a letter...

[–] [email protected] 20 points 1 year ago (1 children)

I always make sure my logs are covered by Spunk.

load more comments (1 replies)

[–] [email protected] 23 points 1 year ago (1 children)

And yet that’s probably there because sometime, somewhere, it returned 1.9 or 2.00001 or some such nonsense.

load more comments (1 replies)

load more comments (6 replies)

[–] [email protected] 76 points 1 year ago (5 children)

I'm this person in my organization. I sent an email up the chain warning folks we were going to eventually run out of space about 2 years ago.

Guess what just recently happened?

ShockedPikachuFace.gif

[–] [email protected] 25 points 1 year ago

You got approval for new SSDs because the manglement recognised threat identified by you as critical?

Right?

[–] [email protected] 19 points 1 year ago

Literally sent that email this morning. It's not that we don't have the space, it's that I can't get a maintenance window to migrate the data to the new storage platform.

load more comments (3 replies)

[–] [email protected] 27 points 1 year ago (2 children)

Ballast!

Just plonk a large file in the storage, make it relative to however much is normally used in the span of a work week or so. Then when shit hits the fan, delete the ballast and you'll suddenly have bought a week to "find" and implement a solution. You'll be hailed as a hero, rather than be the annoying doomer that just bothers people about technical stuff that's irrelevant to the here and now.

[–] [email protected] 16 points 1 year ago (2 children)

Or you could be fired because technically you're the one that caused the outage.

load more comments (2 replies)

load more comments (1 replies)

[–] [email protected] 175 points 1 year ago (6 children)

Sysadmin pro tip: Keep a 1-10GB file of random data named DELETEME on your data drives. Then if this happens you can get some quick breathing room to fix things.

Also, set up alerts for disk space.

[–] [email protected] 54 points 1 year ago

The real pro tip is to segregate the core system and anything on your system that eats up disk space into separate partitions, along with alerting, log rotation, etc. And also to not have a single point of failure in general. Hard to say exact what went wrong w/ Toyota but they probably could have planned better for it in a general way.

[–] [email protected] 31 points 1 year ago* (last edited 1 year ago) (3 children)

Even better, cron job every 5 mins and if total remaining space falls to 5% auto delete the file and send a message to sys admin

[–] [email protected] 21 points 1 year ago

Sends a message and gets the services ready for potential shutdown. Or implements a rate limit to keep the service available but degraded.

load more comments (2 replies)

[–] [email protected] 29 points 1 year ago* (last edited 1 year ago) (3 children)

10GB is nothing in an enterprise datastore housing PBs of data. 10GB is nothing for my 80TB homelab!

[–] [email protected] 28 points 1 year ago (1 children)

It not going to bring the service online, but it will prevent a full disk from letting you do other things. In some cases SSH won’t work with a full disk.

[–] [email protected] 29 points 1 year ago (2 children)

It’s all fun and games until tab autocomplete stops working because of disk space

load more comments (2 replies)

[–] [email protected] 97 points 1 year ago (1 children)

This happens. Recently we had a problem in production where our database grew by a factor of 10 in just a few minutes due to a replication glitch. Of course it took down the whole application as we ran out of space.

Some things just happen and all head room and monitoring cannot save you if things go seriously wrong. You cannot prepare for everything in life and IT I guess. It is part of the job.

[–] [email protected] 22 points 1 year ago (2 children)

Bad things can happen but that's why you build disaster recovery into the infrastructure. Especially with a compqny as big as Toyota, you can't have a single point of failure like this. They produce over 13,000 cars per day. This failure cost them close to 300,000,000 dollars just in cars.

load more comments (1 replies)

[–] [email protected] 64 points 1 year ago (2 children)

There's some irony to every tech company modeling their pipeline off Toyota's Kanban system...

Only for Toyota to completely fuck up their tech by running out of disk space for their system to exist on. Looks like someone should have put "Buy more hard drives" to the board.

[–] [email protected] 23 points 1 year ago (3 children)

not to mention the lean process effed them during fukashima and covid, with a breakdown in logistics and a shortage of chips, meant that their entire mode of operating shut down, as they had no capacity to deal with any outages in any of their systems. Maybe that has happened again, just in server land.

[–] [email protected] 29 points 1 year ago (1 children)

Toyota was the carmaker best positioned for the COVID chip shortage because they recognized it as a bottleneck. They were pumping out cars a few months longer than the others (even if they eventually hit the same wall everyone else did).

load more comments (1 replies)

load more comments (2 replies)

[–] [email protected] 18 points 1 year ago

It was forever ignore in backlog

[–] [email protected] 63 points 1 year ago (2 children)

I blame lean philosophy. Keeping spare parts and redundancy is expensive so definitely don't do it...which is just rolling the dice until it comes up snake eyes and your plant shuts down.

It's the "save 5% yearly and stop trying to avoid a daily 5% chance of disaster"

Over prepared is silly, but so is under prepared.

They were under prepared.

[–] [email protected] 50 points 1 year ago (5 children)

Lean philosophy is supposed to account for those dice-rolling moments. It's not just "keep nothing in inventory", there is supposed to be risk assessment involved.

The problem is that leadership doesn't interpret it that way and just sees "minimizing inventory increases profit!"

load more comments (5 replies)

[–] [email protected] 28 points 1 year ago (4 children)

I work in a manufacturing company that was owned by the founder for 50 years until about 4 years ago when he retired. He disagreed with a lot of the ideas behind lean manufacturing so we had like 5 years worth of inventory sitting in our warehouse.

When the new management came in, there was a lot of squawking about inefficiency, how wasteful it was to keep so much raw material on the shelf, and how we absolutely needed to sell it off or get rid of it.

Then a funny little thing happened in 2020.

Suddenly, we were the only company in our industry still churning out product. Other companies were calling us, desperate to buy our products or even just our raw material. We saw MASSIVE growth the next two years and came out of the pandemic better than ever. And it was mostly thanks to the old owners view that "Just In Time" manufacturing was BS.

load more comments (4 replies)

[–] [email protected] 37 points 1 year ago (2 children)

This is the best summary I could come up with:

TOKYO, Sept 6 (Reuters) - A malfunction that shut down all of Toyota Motor's (7203.T) assembly plants in Japan for about a day last week occurred because some servers used to process parts orders became unavailable after maintenance procedures, the company said.

The system halt followed an error due to insufficient disk space on some of the servers and was not caused by a cyberattack, the world's largest automaker by sales said in a statement on Wednesday.

"The system was restored after the data was transferred to a server with a larger capacity," Toyota said.

The issue occurred following regular maintenance work on the servers, the company said, adding that it would review its maintenance procedures.

Two people with knowledge of the matter had told Reuters the malfunction occurred during an update of the automaker's parts ordering system.

Toyota restarted operations at its assembly plants in its home market on Wednesday last week, a day after the malfunction occurred.

The original article contains 159 words, the summary contains 159 words. Saved 0%. I'm a bot and I'm open source!

[–] [email protected] 40 points 1 year ago (5 children)

Lol good bot I guess

load more comments (5 replies)

[–] dabster291 15 points 1 year ago

Wow, what a useful bot!

[–] [email protected] 33 points 1 year ago (3 children)

Idiots, they ought to have switched to tabs for indenting. Everybody knows that.

load more comments (3 replies)

[–] [email protected] 32 points 1 year ago

This is a fun read in the wake of learning about all the personal data car manufacturers have been collecting

[–] [email protected] 25 points 1 year ago

Free disk space is just inventory and therefor wasteful.

[–] [email protected] 22 points 1 year ago (1 children)

Was this that full shutdown everyone thought was going to be malware?

The worst malware of all, unsupervised junior sysadmins.

[–] [email protected] 13 points 1 year ago

Human error....lol, classic.

[–] [email protected] 22 points 1 year ago

Kanban

[–] [email protected] 14 points 1 year ago (1 children)

Just delete some p0rn

load more comments