• LibraryThing
  • Book discussions
  • Your LibraryThing
  • Join to start using.

downtime / site status page ?

Recommend Site Improvements

Join LibraryThing to post.

This topic is currently marked as "dormant"—the last message is more than 90 days old. You can revive it by posting a reply.

1rsterling
Aug 6, 2009, 1:17am

Just a thought: is there any way to have a site status page hosted on a different server or something so there's a way to find out what's going on when there are down times? I used to check the blog when the rest of the site wouldn't come up, at least to see if it's going to be a big one, but now the blog seems to be inaccessible too when the rest of the site goes down.

2conceptDawg
Aug 6, 2009, 1:27am

Tonight's outage wasn't the servers. Our servers were just fine. It was the entire data center that went down. So having another, dedicated status machine at our data center wouldn't have done much good.

That being said, we've talked about having a separate status machine located somewhere other than our data center. We just haven't had the resources (human) to get it done quite yet.

3timspalding
Aug 6, 2009, 1:28am

If we had a separate server, people would need to know how to get to it. I think my Twitter account pretty much serves the purpose...

http://twitter.com/librarythingtim

T

4rsterling
Aug 6, 2009, 1:59am

Thanks for the response. Twitter's a good idea - bookmarking it...

5felius
Aug 6, 2009, 2:02am

You could try mine, too, seeing as I post more about outages than Tim does, and I get paged when something breaks. ;)

6jjmcgaffey
Aug 6, 2009, 5:23am

Yes - I saw your tweets (both of you) about the outage today. Fortunately it didn't go down until after I'd done a shelf of my mom's books to lure her into using LT...

The (repeated) questions as to whether Boston was a smoking hole in the ground were...almost amusing. Would have been entirely so if I was quite certain the answer was 'no'!

7BTRIPP
Aug 6, 2009, 7:42am

LiveJournal used to (it may still) have a status page hosted on Warped.com that if LJ went down would somehow get the hits (and let folks know there was a problem) ... I'm not sure exactly how that worked, but something like that could be an option.

 

8felius
Aug 6, 2009, 8:26am

(gets a bit technical, apologies in advance!)

I definitely want a status page hosted somewhere other than the main web server, and I want to track availability at a much more granular level. Now that the migration to the new colo has happened we have enough hardware to handle outages in much smarter ways, and there are lots of things on my TODO list that are all about improving reliability and availability. This will include being able to show a status/outage page even if a web server dies (though preferably we'll keep going without you noticing, while I get paged to fix it!).

Our uptime *has* been much better since the move - we almost hit 100% in July until something broke right near the end of the month. :/ We've had a couple of short outages since then that were due to changes in the code causing excess load on the DB servers.

This case was pretty unusual though - our network connectivity is backed by multiple backbone providers, and traffic should be transparently shifted between them in the event of a failure. What happened was that our hosting provider had a power failure to the network core itself, which cut off connectivity to their two datacenters in Boston. (Actually I suspect it was a drastic equipment failure - they said "The power event was limited to DC power systems that provide power to the Internap PNAP and not customer UPS systems." Still waiting to hear the full story..)

In order to deal with that we'd need to be able to dynamically re-route an IP address to a server hosted in another facility. That's technically possible, but not something we're set up to do (and I suspect not something that's high on the priority list at the moment.)

9LibraryThingLuke
Aug 6, 2009, 9:29am

As far as a place for status messages goes, Twitter is down way more than LT. Maybe we should host one for them. Or, there's always http://downforeveryoneorjustme.com/

10felius
Aug 6, 2009, 10:00am

How is it that I've forgotten to mention (until now) that we actually *do* have an external status page. At present it only indicates whether or not the main web server is up, so it doesn't count outages where you can see a "LibraryThing is down" page. I'm planning to add better reports here before I do anything more fancy, though..

11BTRIPP
Aug 6, 2009, 10:05am

Oddly enough ... LiveJournal (along with Twitter, for some reason) is down this morning, so I got a chance to actually check out http://status.livejournal.com/ ... which is still out there and still on Warped.com ... this at least gives a place for the organization to communicate info about the downtime to its users!

 

12BTRIPP
Aug 6, 2009, 10:09am

Hey ... that http://downforeveryoneorjustme.com/ is cool! I guess it's "not just me" that's having problems with both LiveJournal and Twitter this morning ... of course, I'm going nuts now since the only place I can go and blither about it is FaceBook :-(

 

13PhaedraB
Aug 6, 2009, 11:19am

> 12

I feel your pain (sort of). FB was acting wonky for me this morning. Luckily, most of my blithering is on another site, with which y'all may be familiar ...

14timspalding
Aug 6, 2009, 2:41pm

The problem with our status page is that "up" to you is a technical term. It means the server is responding. To you, if the server is delivering a "down" page, we're up.

Can Pingdom calculate the "real" up?

15MarthaJeanne
Edited: Aug 6, 2009, 3:00pm

Ah, yes, the old, 'yes the program is running, it's just not responding to users' problem.

16infiniteletters
Aug 6, 2009, 3:30pm

14: If the server is delivering a down web page, then we don't need to check an external page...

17timspalding
Aug 6, 2009, 4:16pm

>16

Yes, I but I'd love to know "how we're doing" over time.

18felius
Aug 6, 2009, 7:57pm

The problem with our status page is that "up" to you is a technical term. It means the server is responding. To you, if the server is delivering a "down" page, we're up.


No, "up" means the same thing to me as it does to everybody else. I just track a lot more things that can be "up" or "down". However I agree that most people care whether or not "LibraryThing" is up, rather than the status of individual components.


Can Pingdom calculate the "real" up?


Yes. We just tell them to request scripts which return a tiny block of XML indicating the status of whatever service we're tracking. We just need to write the scripts!

19timspalding
Aug 7, 2009, 2:46am

Cool. So, let's focus on something that combines when we're unreachable (down down) and when we're . _down down.

20justjim
Aug 7, 2009, 3:04am

Can't you just phone everybody? Or accept collect calls from everybody who wants to know what's going on.

//runs and hides//

21timspalding
Aug 7, 2009, 3:14am

Dude, just come over. If the site's down, I'll be up and we can have a beer.

22justjim
Aug 7, 2009, 3:22am

Mate, if any outage lasts long enough for me to get from here to there, your site is toast!

23infiniteletters
Aug 10, 2009, 9:02am

22: Oh, just take the orbital flight. ;)

24reading_fox
Edited: Aug 10, 2009, 11:00am

#22 you live 4 days (the length of the longest downtime I remember back in summer '07?) away from Maine? That's truly remote, only a tiny percentage of the world is 4 days away from "civilisation".

And yes it was stressful for us poor users unable to access our daily fix.

25LucindaLibri
Dec 1, 2009, 9:41pm

Tonight I hauled out a huge pile of books to add . . . got through a handful and then the LTsite went down. Now it seems to be back up, but every library I try to add from gives the "Don't Panic" message and when I try to follow the link to "Search Other Libraries" that doesn't work either . . . none of the twitter or other links mentioned above seems to mention the problem . . .
Thus, I renew the request for some sort of status feed we can subscribe to in order to assess "What's up with LT?"
I will now leave my pile of books and go drink some decaf.

26felius
Dec 1, 2009, 10:17pm

>25 Ugh. Sorry about that. The outage was my fault - I killed a server by exhausting all physical RAM, taking most of our in-memory cache and the library search services with it.

I had it back up fairly quickly, but failed to notice that some of the services on the search side didn't come back on boot. I've made changes to the configuration of that server to prevent that happening again.

You're absolutely right though, we need better monitoring and better communication of current status. The ball is in my court.

Group: Recommend Site Improvements

68,660 messages

This group does not accept members.

About

This topic is not marked as primarily about any work, author or other topic.

Touchstones

No touchstones

Help/FAQs | About | Privacy/Terms | Blog | Contact | APIs | WikiThing | Common Knowledge | Legacy Libraries | 70,035,340 books!