These jobs I found useful for my buddies, searching for ITSM jobs in United States. ✅ 🔥 Please Like this post to keep me motivated in sharing more jobs! 💪💼
Job Listings: 15 March
➤ MetroNational, Houston, TX is hiring IT Support Technician – Apply Here
➤ CAI, United States is hiring Senior ITSM Consultant – Apply Here
➤ Ahold Delhaize USA, Chicago, IL is hiring Director II – IT Service Operations – Apply Here
➤ Jobs via Dice, United States is hiring Senior ITSM Consultant/ServiceNow – Apply Here
➤ Prosum, Chicago, IL is hiring ServiceNow Engineer – CMDB – Apply Here
➤ Atlassian, United States is hiring Solutions Engineer, ITSM West – Apply Here
➤ iConsultera, United States is hiring Information Technology Support Specialist – Apply Here
➤ Motion Recruitment, United States is hiring Information Technology Technical Support – Apply Here
🔍 Explore All Related ITSM Jobs Below! 🚀 ✅ Select your preferred “Job Category” in the Job Category Filter 🎯 🔎 Hit “Search” to find matching jobs 🔥 ➕ Click the “+” icon that appears just before the company name to see the Job Detail & Apply Link 📝💼
Job Id
Date Posted
MM/YY
Company
Type
Top Company
sTARTUP
Cosultancy ?
Industry Type
Location
Country
Estd Year
Role
Role Category
part time
Job Category
Tools
Exp.
Salary
Salary
-Ad-
Detail
Apply Link
Contact
HR Contact ?
Walkin
urgent
Career Gap
Female Preferred
Submit CV
Remote Jobs
Freelancer
TotalJobs
Certification Summ..
Masters
Trainer
TopCollege
ProdDev
InternExp
Onsite
research
certification
6
12/12/2024
Randstad USA
Top / Job Consultancy
Top Company
Job Consultancy
HR / Staffing
New York, NY
United States
IT Incident Manager
Unknown
Incident Mgmt.
ITIL
Pls See The Job Detail
Not disclosed
Other
IT Incident Manager
Start: 2 weeks from date of offer
Location: New York, NY
Schedule: First 1-2 months onsite and then transition to onsite Mon-Wed w/ Thur and Friday remote.
Temp to Perm position: 6 months before conversion
Hourly Pay for contract: 50-70 p/hr on W2 Only
*Background and Drug Screen Required*
My client is looking to add an IT Incident Manager to their team. Primarily working over the Service Management / Service Desk department. ServiceNow is the ticketing system used. This is a coaching/mentoring/daily management role over a 3rd party Service Desk company.
This person will also ideally be the backup Change Management resource behind the primary Change Manager - if/when necessary
Musts:
ITIL
Onsite in NY, NY
Experience with ServiceNow
Incident Response / Incident Management
Plus:
Any management experience in Infrastructure
1
Tools: ITIL
No
19
12/12/2024
Allstate
Top
Top Company
BFSI / Fintech / NBFC
United States
United States
Manager, Incident Handling (Remote - Home Based Worker)
Unknown
Incident Mgmt.
AWS
Pls See The Job Detail
Not disclosed
Other
At Allstate, great things happen when our people work together to protect families and their belongings from life�s uncertainties. And for more than 90 years our innovative drive has kept us a step ahead of our customers� evolving needs. From advocating for seat belts, air bags and graduated driving laws, to being an industry leader in pricing sophistication, telematics, and, more recently, device and identity protection.
Job Description
Job Summary
The Incident Handling Manager of the Global Security Fusion Center (GSFC) will oversee the operations of Incident Handling such as incident response, threat detection and mitigation efforts. This individual will effectively run the operations by coordinating with different service areas across GSFC and maintain the overall security posture of the organization. This leadership role will effectively communicate with stakeholders and senior leadership team especially during a major incident. This individual serves as an incident manager during major incidents and supports the investigations carried out by the team.
The Incident Handling Manager mentors the Lead analysts and other analysts of the team and reviews the performance of the individuals. This Manager role also leads teams in the development and evaluation of programs, processes and procedures to mitigate cybersecurity risk, ensuring protection of company information and assets, and understanding and applying pertinent industry and government regulations, contracts, and requirements.
Key Responsibilities
Lead and manage Security Operations Center across different regions and shifts with primary responsibilities in security event monitoring, management, and response.
Ensure incident identification, assessment, quantification, reporting, communication, mitigation and monitoring.
Ensure compliance to SLA, process adherence and process improvisation to achieve operational objectives.
Review policies and highlight the challenges in managing SLAs.
Review standard operating procedures to ensure SOC continues to effectively meet operational requirements.
Provide team & vendor management, evaluate overall use of resources, and initiate corrective action where required for SOC.
Create reports, dashboards, and metrics and present to Sr. Mgmt.
Evaluate existing technical capabilities and systems and identify opportunities for improvement.
Oversee training and exercises to ensure SOC team proficiency. Conduct after action reviews to identify lessons learned and best practices.
Work closely Security Leadership to identify and implement process changes, improvements, and efficiencies to ensure solid security practices.
Develop communication channels with technology owners and business to evangelize the evolving threat landscape.
Job Qualifications
Ideal candidate will have 10+ years incident handling and incident response experience.
Technical knowledge of network security, operating system security, vulnerability management, common attacker techniques and exploits, encryption, and SIEM.
Know how to lead investigations and direct incident handlers and question the investigative process being followed.
Possess experience in writing both technical incident investigation reports as well as reports for senior leadership.
Ability to manage multiple initiatives at once in addition to day-to-day operations.
Experience in managing teams of 8 or more people and providing mentorship.
Advanced incident investigation and response experience.
Moderate knowledge of Windows, Unix/Linux, and Mac operating systems.
Moderate knowledge of SIEM technologies and use case design.
Moderate knowledge of malware operations and indicators.
Moderate knowledge of network defenses such as firewalls, IDS/IPS, Packet Capture, Proxies.
Moderate experience with scripting.
Moderate knowledge of forensic techniques.
Moderate knowledge of audit requirements (PCI, HIPPA, SOX, etc.).
Education and Certifications
Bachelor�s Degree preferred, but not required. May also have advanced degree.
Certifications from the list below preferred, but not required:
Certified Information Systems Security Professional (CISSP)
Certified Information Security Manager (CISM)
Certified Information Systems Auditor (CISA)
Certified Information Systems Security Professional (CISSP)
Certified Incident Handler (GCIH)
Certified Intrusion Analyst (GIAC)
Certified Ethical hacker (CEH)
Certified Expert penetration tester (CEPT)
Functional Skills
Advanced understanding of information security technology.
Ability to influence others and achieve results.
Ability to think strategically, conceptually, analytically and creatively.
Advanced time management skills including ability to manage multiple projects, prioritize and organize, and create alignment and buy in from clients and direct reports.
Demonstrated clear, concise and effective oral and written communication skills.
Ability to establish, manage and leverage relationships with internal and external partners.
Ability to analyze data and apply it to complex problem resolution.
Advanced understanding of security trends in the industry.
Advanced understanding of expense and resource management processes as they relate to project resources and expenses, ability to make appropriate sourcing decisions based on project resource needs, demonstrate understanding of area's budget and its relationship to department level information and external customers, and assist in explaining impact of project expenses on the area's overall expense plan.
Possess a mastery level of thriving in change by setting an example, providing leadership, and coaching peers and team members, during changes that may disrupt productivity.
Handle ambiguity and uncertainty with maturity, adapt management style to current situations and people needs, and demonstrate the ability to make decisions under time pressures and with limited information.
Skills
Change Management, Information Security Management, Information Security Operations, Network Security, People Leadership, Security Incident Response, Security Information and Event Management (SIEM)
Compensation
Compensation offered for this role is $104,000 - 187,625 annually and is based on experience and qualifications.
The candidate(s) offered this position will be required to submit to a background investigation, which includes a drug screen.
Joining our team isn�t just a job � it�s an opportunity. One that takes your skills and pushes them to the next level. One that encourages you to challenge the status quo. And one where you can impact the future for the greater good.
You�ll do all this in a flexible environment that embraces connection and belonging. And with the recognition of several inclusivity and diversity awards, we�ve proven that Allstate empowers everyone to lead, drive change and give back where they work and live.
Good Hands. Greater Together.�
Allstate generally does not sponsor individuals for employment-based visas for this position.
Effective July 1, 2014, under Indiana House Enrolled Act (HEA) 1242, it is against public policy of the State of Indiana and a discriminatory practice for an employer to discriminate against a prospective employee on the basis of status as a veteran by refusing to employ an applicant on the basis that they are a veteran of the armed forces of the United States, a member of the Indiana National Guard or a member of a reserve component.
For jobs in San Francisco, please click �here� for information regarding the San Francisco Fair Chance Ordinance.
For jobs in Los Angeles, please click �here� for information regarding the Los Angeles Fair Chance Initiative for Hiring Ordinance.
To view the �EEO is the Law� poster click �here�. This poster provides information concerning the laws and procedures for filing complaints of violations of the laws with the Office of Federal Contract Compliance Programs
To view the FMLA poster, click �here�. This poster summarizing the major provisions of the Family and Medical Leave Act (FMLA) and telling employees how to file a complaint.
It is the Company�s policy to employ the best qualified individuals available for all jobs. Therefore, any discriminatory action taken on account of an employee�s ancestry, age, color, disability, genetic information, gender, gender identity, gender expression, sexual and reproductive health decision, marital status, medical condition, military or veteran status, national origin, race (include traits historically associated with race, including, but not limited to, hair texture and protective hairstyles), religion (including religious dress), sex, or sexual orientation that adversely affects an employee's terms or conditions of employment is prohibited. This policy applies to all aspects of the employment relationship, including, but not limited to, hiring, training, salary administration, promotion, job assignment, benefits, discipline, and separation of employment.
Remote / Work From Home (WFH) / Virtual
1
....ate knowledge of audit requirements (PCI, HIPPA, SOX, etc.).
Education and Certifications
Bachelor�s Degree preferred, but not required. May also have advanced degree.
Certifications from the list below preferred, but not required:
<....
Tools: AWS
No
Not Mandetory
20
12/12/2024
Stripe
Top / Startup
Top Company
Startup
BFSI / Fintech / NBFC
United States
United States
Incident Response Manager
Unknown
Incident Mgmt.
Scripting Skills
Pls See The Job Detail
Not disclosed
Other
Who we are
About Stripe
Stripe is a financial infrastructure platform for businesses. Millions of companies�from the world�s largest enterprises to the most ambitious startups�use Stripe to accept payments, grow their revenue, and accelerate new business opportunities. Our mission is to increase the GDP of the internet, and we have a staggering amount of work ahead. That means you have an unprecedented opportunity to put the global economy within everyone�s reach while doing the most important work of your career.
About The Team
The Incident Ops team is a global 24/7 team responsible for driving incident response and management from detection to resolution. Stripe is proud of its five 9s reliability and this team is at the forefront of ensuring we keep it that way - working hand-in-hand with Reliability Eng and across the Tech Org. This team of incident response managers (IRM) is defined by our sense of ownership and how we drive incidents to resolution - marshaling the necessary cross-functional resources to respond to and resolve service outages, critical bugs, security attacks and anything that significantly impacts the users of our products. The team is user-first and ensures appropriate external communications from Stripe and senior management to keep our users informed of disruption to their experience of Stripe. The team is skilled in communications, incident handling and technical adeptness as incidents can arise from anywhere and cut across products and orgs in Stripe.
What you�ll do
As an Incident Response Manager (IRM), you�ll play the key role in driving the right level of response from Stripes to incidents, determining impact, rallying Stripes to mitigate, communicating to users and ensuring appropriate remediations and orchestrate the Root Cause Analysis (RCA) process. You�ll work hand-in-hand with IRMs and engineers globally to ensure solid 24/7 coverage on how we monitor, detect, respond, communicate and mitigate incidents. When not managing incidents, you'll help scale our ability to respond to incidents, improve our operations, analyze data to provide insights and deepen our technical expertise in products. As a result, you�ll be seen as the protector of our users - in minimizing the impact of incidents on their business and ensuring that Stripe is always thinking of our users.
Responsibilities
Act as an on-call Incident Commander, responsible for driving and managing incident resolution with a high level of urgency, cross-functional collaboration, and accuracy, while partnering with a global and diverse set of teams, including Engineering, Product, Policy, Risks, PR, Legal, Execs, etc.
Lead all user-facing incidents across domains at Stripe - including reliability, technical, security, and data privacy
"User First" approach to determine impact, providing accurate situation reports, facilitating comms bridges, and ensuring useful and timely external communications to users
Proactively update internal stakeholders, make decisions through data and influence by partnering with Engineering, Sales, Support and other cross-functional teams
Contribute to the root cause analysis process while conducting post-mortems, remediations identification, and ensure problem management tasks meet SLA and user expectations
Drive improvements in the incident handling process and incident management metrics and tooling based on trends and data of Stripe's incidents in collaboration with engineering, product and operations teams
Collaborate closely with leadership for building team strategy based on the team vision
Collaborate and coach other Incident Response Managers on the team
Who you are
We�re looking for someone who meets the minimum requirements to be considered for the role. If you meet these requirements, you are encouraged to apply. The preferred qualifications are a bonus, not a requirement.
Minimum Requirements
5+ years of demonstrable major incident experience for organizations that run mission critical applications or always-on Saas environments.
Demonstrated ability to lead multiple incidents concurrently with authority and influence responders with agency and reasoning skills to resolve ambiguous problems and drive to root cause.
Strong full stack technical skills with development/support experience with cloud based technologies
Demonstrated experience developing code and automation using Python, Ruby, JavaScript or shell scripting.
Solid understanding of infrastructure, including physical, virtual, and container-based compute platforms
Strong quantitative, and analytical skills in data manipulation using SQL, Splunk or other tools.
Excellent task management skills, must be detail-oriented with ability to remain composed, methodical, and think fast in a high-pressured environment.
Exceptional written and verbal English communication skills, with the ability to translate complex technical issues for internal and external stakeholders
Preferred Qualifications
Domain expertise in classes of incidents such as technical, privacy, security or crisis with a strong desire to continuously learn about Stripe's products, technical issues and systems.
Ability to review complex technical details regarding ongoing issues/events and convey the key details to senior stakeholders to facilitate real-time decision making.
Experience with broad user-facing communications (e.g. status pages, tweets) and/or targeted communications (e.g. direct emails, support ticket responses).
Familiarity operating or managing distributed architectures with the ability to correlate system behaviors based on known inter-dependencies.
Demonstrated experience with full stack development and support
Working remotely at Stripe
A remote location, in most cases, is defined as being 35 miles (56 kilometers) or more from one of our offices. While you would be welcome to come into the office for team/business meetings, on-sites, meet-ups, and events, our expectation is you would regularly work from home rather than a Stripe office. Stripe does not cover the cost of relocating to a remote location. We encourage you to apply for roles that match the location where you currently or plan to live.
Pay and benefits
The annual US base salary range for this role is $180,200 - $270,300. For sales roles, the range provided is the role�s On Target Earnings ("OTE") range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role. This salary range may be inclusive of several career levels at Stripe and will be narrowed during the interview process based on a number of factors, including the candidate�s experience, qualifications, and location. Applicants interested in this role and who are not located in the US may request the annual salary range for their location during the interview process.
Additional benefits for this role may include: equity, company bonus or sales commissions/bonuses; 401(k) plan; medical, dental, and vision benefits; and wellness stipends.
1
Tools: Splunk
No
40
12/12/2024
ALTA IT Services, LLC
Job Consultancy
Job Consultancy
HR / Staffing
United States
United States
Major Incident Manager (night shift)
Unknown
Major Incident Manager
Azure
Pls See The Job Detail
Not disclosed
Other
ALTA IT Services is a wholly owned subsidiary of System One, a leading provider of specialized workforce solutions and integrated services. ALTA is an established leader in IT Staffing and Services, for both government and commercial enterprises across the United States, specializing in Program & Project Management, Application Development, Cybersecurity, Data & Advanced Analytics, and Agile Transformation Services.
- Contractors will be expected to work every other weekend and on some holidays
Job Description
Major Incident Management is responsible for driving the coordination and recovery efforts of major outages at the client. When issues impact the client�s services or systems, major outages may occur, which result in serious interruptions to business and member activities. The Major Incident Management team operates 24x7 to ensure that impacted services are restored as efficiently and effectively as possible. The team actively monitors systems and services, documents and timelines recovery efforts, manages and coordinates various support team activities, and notifies business units of potential impacts and on-going recovery efforts. The team is also responsible for providing continual process improvement suggestions for the major incident management service, and monitoring for weekend change activities and military pay days.
Major Responsibilities
� Monitors Service Desk ticket queues, system alerts, and escalation methods to identify possible trends or outages
� Serves as the main point of contact for all incident and service issue escalations directed to the Major Incident Management team
� Ensures that incident management processes are efficiently and effectively followed
� Determines the impact and priority of incidents based on affected customers and/or business units
� Communicates operational issues to respective IT management, support teams, and incident communication managers
� Provides outage notification and recovery effort updates to business units via the Status Page
� Engages various support teams and resources to major incident bridges
� Manages and coordinates troubleshooting and recovery efforts between support teams and vendors
� Ensures continuous collaboration with IT Operations Management and other areas or teams
� Documents initial issues, recovery activities, and resolution steps taken via MIM timelines
� Ensures prompt resolution and coordination of incident management activities during recovery efforts
� Updates and validates outage information in availability management tools for reporting and tracking purposes
� Makes recommendations, proposals, and suggestions for improvement within the service to reduce severity and frequency of incidents
� Attends Post Incident Review Meetings or reviews meeting notes once the meetings conclude to ensure compliance with service improvement initiatives
� Attends and participates in TCABs (technical change advisory board meetings) to review, discuss, and approve or reject concerning upcoming changes or releases to the environment
� Coordinates, communicates, and manages Sunday Maintenance Windows for weekend scheduled activities
� Works with Problem Management and Change Management to resolve incidents
� Coordinates, communicates, and manages Military Pay Bridge activities
� Prepares operational status reports to IT Operations Management
� Updates and publishes Morning Reports
Required Qualifications
� Bachelor�s Degree in a related field, or the equivalent combination of education, training, and/or experience
� Extensive IT experience that demonstrates knowledge of hardware and infrastructure protocols used to provide services to customers
� Extensive IT experience in at least one of the following areas: mainframe, networking, middleware Websphere, Azure
� Prior experience leading incident bridge calls from initial triage to guiding recovery efforts, maintaining a timeline and ensuring that service is restored as quickly as possible
� Experience in leading or supervising an IT team
� Demonstrated ability to lead others in a challenging and fast-paced large enterprise environment
� Strong research, analytical, and problem solving skills
� Strong planning, organizational, and multi-tasking skills
� Demonstrated ability in exercising initiative to produce desired results and achieve objectives
� Ability to effectively interface with various levels of employees, management, and vendors
� Excellent interpersonal, verbal, and written communication skills
� Practical Incident management work experience
Desired Qualifications
� ITIL v3 or v4 Foundations Certificate
� CCNA / Networking Training and Certificates
� Middleware Training and Certificates
� Azure Training and Certificates
System One, and its divisions and subsidiaries including Joul�, ALTA IT Services, CM Access, and MOUNTAIN, LTD., are leaders in delivering workforce solutions and integrated services across North America. We help clients get work done more efficiently and economically, without compromising quality. System One not only serves as a valued partner for our clients, but we offer eligible full-time employees health and welfare benefits coverage options including medical, dental, vision, spending accounts, life insurance, voluntary plans, as well as participation in a 401(k) plan.
System One is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender identity, age, national origin, disability, family care or medical leave status, genetic information, veteran status, marital status, or any other characteristic protected by applicable federal, state, or local law.
Remote / Work From Home (WFH) / Virtual
1
Tools: Azure, ITIL
No
54
12/12/2024
DexCare
Startup
Startup
IT / Software Dev
United States
United States
Manager, Release & Incident Management
Unknown
Release Manager
AWS
Pls See The Job Detail
Not disclosed
Other
Who is DexCare?
DexCare optimizes time in healthcare, streamlining patient access, reducing waits, and enhancing overall experiences. Born within Providence, DexCare addresses tech gaps by aligning supply and demand, recently securing a $75 million Series C funding round led by ICONIQ Growth, alongside partners like MGB, Kaiser Permanente Ventures, and others, to modernize healthcare infrastructures for an inclusive ecosystem.?
What is DexCare?
DexCare, a digital care orchestration platform, streamlines care delivery logistics. It empowers healthcare systems to predict constraints and precisely schedule services, optimize capacity, and cuts operational costs. Currently serving 57 million patients, including Kaiser Permanente and Providence, DexCare ushers in a new era of digital-care access, ensuring health systems can efficiently track and deliver every hour of capacity for consumer ease.
For more information, visit?www.dexcare.com?or follow us on?LinkedIn.
We are looking for a strategic and detail-oriented Manager, Release & Incident Management to manage and oversee the entire release process from planning to deployment, ensuring quality and smooth execution across our ecosystem. This role requires someone with a strong understanding of development and operations, exceptional communication skills, and the ability to collaborate with multiple teams and stakeholders. You will also play a crucial role in driving process improvements and automation initiatives to enhance efficiency and scalability.
Role is based out of our Seattle HQ (Eastlake area) 2 days/week.
What You�ll Do
End-to-End Release Management: Plan, coordinate, and manage the release process, ensuring quality and timely delivery
Stakeholder Communication: Communicate project commitments, changes, requirements, and progress to all stakeholders, including technical and non-technical teams
Risk Management: Identify and mitigate risks that may affect the release scope, quality, and schedule
Collaboration: Work closely with Product, Engineering, Infrastructure, and Customer Experience teams, as well as other IT departments to align release efforts
Release Calendar: Develop and maintain a centralized release calendar, providing visibility of all upcoming releases across the organization
Process Improvement: Lead initiatives to improve and streamline release management processes, focusing on reducing manual intervention and enhancing overall efficiency
Automation: Drive the automation of key release and deployment phases, ensuring faster, more reliable, and scalable processes
Documentation & Guidelines: Maintain detailed documentation on build and release processes, ensuring QA teams and all stakeholders are aligned with project guidelines
Release Tracking & Reporting: Measure and monitor the progress of releases, providing weekly updates on release activities and resolving any issues related to quality or scheduling
Feedback Integration: Collect and integrate feedback from teams and customers to enhance future releases
Dependency Management: Identify and address any dependencies or impacts on third-party applications, infrastructure updates, or defect backlogs that could alter release timelines
What You�ll Bring
4+ years of experience managing software releases across an organization
Strong understanding of SDLC (Software Development Lifecycle), release management standards, and best practices
Experience with CI/CD systems, source control, build systems, automated testing, and software delivery systems
Proven track record of driving process improvements and implementing automation in release processes
Proficiency in project management, risk management, and change management
Excellent written and oral communication skills, with the ability to collaborate effectively across teams
Self-motivated and able to manage complex release processes independently
Experience with Atlassian, Jira and Confluence
Experience with Asana is preferred
DexCare is an Equal Opportunity Employer
All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations, and ordinances. DexCare does not exclude people or treat them differently because of race, color, national origin, age, disability, or sex. DexCare provides reasonable accommodation to all applicants who require such accommodation to apply for the position or to perform the essential functions of the job.
The salary range for this role is from $140,000-160,000 + Equity.
DexCare offers an outstanding benefits package, including: -Eligible for Annual Bonus -Healthcare benefits, short/long-term disability coverage, life insurance, and 401k -Paid Parental Leave -Eight paid holidays & Unlimited PTO -Hybrid and remote working arrangements
Please note the national salary range listed in the job posting reflects the new hire salary range across levels and U.S. locations that would be applicable to the position. Final salary will be commensurate with the candidate�s final level and final location. Also, this range represents base salary only and does not include bonus, equity, or benefits if applicable.
1
Tools: AWS, Confluence, Jira
No
86
12/12/2024
MHK TECH INC
Startup / Job Consultancy
Startup
Job Consultancy
HR / Staffing
Houston, TX
United States
IT Incident Manager
Unknown
Incident Mgmt.
Jira
Pls See The Job Detail
Not disclosed
Other
Job Summary
We are seeking an experienced Incident Manager to lead and coordinate responses to IT incidents, ensuring swift resolution and minimal disruption to business operations. The ideal candidate will have strong knowledge of Jira, experience with Root Cause Analysis (RCA), familiarity with data warehousing concepts, and the ability to work in US office hours.
Key Responsibilities
Lead and manage all aspects of the incident response process, including identification, prioritization, and escalation.
Coordinate with IT teams and stakeholders to ensure efficient resolution of incidents and timely communication.
Use Jira to track incidents, ensure accurate documentation, and analyze incident trends.
Conduct Root Cause Analysis (RCA) and post-incident reviews to identify underlying issues, recommend corrective actions, and drive continuous improvement.
Collaborate with teams to understand data warehousing concepts for better incident diagnosis and resolution.
Requirements
Proven experience in IT incident management or a similar role.
Proficiency in Jira, Root Cause Analysis (RCA), and familiarity with data warehousing concepts.
Strong communication, problem-solving, and leadership skills.
Ability to work in high-stress, fast-paced environments and during US office hours.
1
Tools: Jira
No
90
12/12/2024
TikTok
Top
Top Company
TV / Firm / Media / Entertainment
San Jose, CA
United States
Incident Response Manager - Data Center
Unknown
Incident Mgmt.
AWS
Pls See The Job Detail
Not disclosed
Other
Responsibilities
TikTok is the leading destination for short-form mobile video. At TikTok, our mission is to inspire creativity and bring joy. TikTok's global headquarters are in Los Angeles and Singapore, and its offices include New York, London, Dublin, Paris, Berlin, Dubai, Jakarta, Seoul, and Tokyo.
Why Join Us
Creation is the core of TikTok's purpose. Our platform is built to help imaginations thrive. This is doubly true of the teams that make TikTok possible.
Together, we inspire creativity and bring joy - a mission we all believe in and aim towards achieving every day.
To us, every challenge, no matter how difficult, is an opportunity; to learn, to innovate, and to grow as one team. Status quo? Never. Courage? Always.
At TikTok, we create together and grow together. That's how we drive impact - for ourselves, our company, and the communities we serve.
Join us.
About The Team
The Data Systems Infrastructure (DSI) team sits within the ByteDance global technology structure and supports the company's fast growth by building and operating hyper-scale datacenters, managing the life cycle of server fleet, providing cloud solutions, and developing various infrastructure services, making sure they are scalable and are reliable.
The Incident Response Center (IRC) is the first layer of defense responsible for quick detection and incident response using various monitoring and automation tools, conducting thorough investigation of alerts, classification and triage. The Incident Response Manager is responsible for delivering operations within the IROC across all ByteDance datacenter sites in the respective regions. IRC team is expected to respond to all alarms/alerts set in Server Automation Operations System (SAOS), Data Center Infrastructure Management (DCIM) to quickly discover anomalies and engage Subject Matter Expert (SME) teams to start issue triage. The IRC team provides business intelligence through rigorous analysis of alerts and issues which reduce and prevent recurring incidents .
Responsibilities
Delivering global operations within the IROC (Incident Response Operation Center) ByteDance datacenter.
First responder and layer of defense responsible for quick detection and incident response using various monitoring and automation tools, conduct thorough investigation of alerts, classification and triage.
Respond to all infrastructure, facilities, security, and safety events notified via various means, such as alarms/alerts set in Server Operations and Maintenance, Datacenter Infrastructure Management, Network & Grafana, and other functions.
Respond to incidents and critical situations in a problem-solving manner, and conduct in-depth investigation of alerts.
Provide insights into the effectiveness of the incident response and recovery process through regular reports
Analyze trends and patterns in events to identify opportunities for improvement and optimization
Monitor the performance of incident response against the agreed-upon SLAs by alerting and notifying stakeholders
Escalation Management notifying or initiating discussions with higher-level support teams engaging in resolution processes
Identify, assess and communicate potential risks arising through event monitoring that could affect customer's service
Support program managers and facilitate project deliverables, improve overall operational security and engineering initiatives
The Incident Response team is expected to work at ByteDance datacenter site. This is an on-site role.
Qualifications
Minimum Qualifications
Knowledge of technical elements associated with systems such as Server Health, Datacenter Environment and IP Networks.
Outstanding verbal and written communication skills required, work with minimal direction, meeting goals, attention to details and an eye for continuous improvements.
Preferred Qualifications
Degree in Information Technology.
5 years experience in service center, or similar 24x7 operations center environment.
3 years of experience in a technology company or experience as a team lead, and experience in operation program management.
5 years experience as an incident and problem manager.
Good data analytics and presentation skills.
Ability to successfully interact at all levels of the organization, including with clients, while functioning as a team player.
Basic working knowledge of data protection policies such as GDPR and the need to keep sensitive information secure.
Working knowledge and/or certifications in ITIL, CompTIA Server+, Schneider Electric Data Center Certified Associate (DCCA), Data Analytics and Visualization.
Willingness to be on call including weekends, nights, and holidays.
Works well under pressure and within time constraints to solve problems and complete deliverables.
TikTok is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At TikTok, our mission is to inspire creativity and bring joy. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.
TikTok is committed to providing reasonable accommodations in our recruitment processes for candidates with disabilities, pregnancy, sincerely held religious beliefs or other reasons protected by applicable laws. If you need assistance or a reasonable accommodation, please reach out to us at https://shorturl.at/cdpT2
The base salary range for this position in the selected city is $109600 - $218400 annually.
Compensation may vary outside of this range depending on a number of factors, including a candidate�s qualifications, skills, competencies and experience, and location. Base pay is one part of the Total Package that is provided to compensate and recognize employees for their work, and this role may be eligible for additional discretionary bonuses/incentives, and restricted stock units.
Benefits may vary depending on the nature of employment and the country work location. Employees have day one access to medical, dental, and vision insurance, a 401(k) savings plan with company match, paid parental leave, short-term and long-term disability coverage, life insurance, wellbeing benefits, among others. Employees also receive 10 paid holidays per year, 10 paid sick days per year and 17 days of Paid Personal Time (prorated upon hire with increasing accruals by tenure).
The Company reserves the right to modify or change these benefits programs at any time, with or without notice.
For Los Angeles County (unincorporated) Candidates:
Qualified applicants with arrest or conviction records will be considered for employment in accordance with all federal, state, and local laws including the Los Angeles County Fair Chance Ordinance for Employers and the California Fair Chance Act. Our company believes that criminal history may have a direct, adverse and negative relationship on the following job duties, potentially resulting in the withdrawal of the conditional offer of employment:
Interacting and occasionally having unsupervised contact with internal/external clients and/or colleagues;
Appropriately handling and managing confidential information including proprietary and trade secret information and access to information technology systems; and
Exercising sound judgment.
1
....s such as GDPR and the need to keep sensitive information secure.
Working knowledge and/or certifications in ITIL, CompTIA Server+, Schneider Electric Data Center Certified Associate (DCCA), Data Analytics and Visualization.
Willingness to be on call including weekends, nights, and....
Tools: AWS, ITIL
No
Yes
723
12/12/2024
Smarsh
Top / Startup
Top Company
Startup
BFSI / Fintech / NBFC
United States
United States
Director of IT Incident and Problem Management
President/ Head/ Director
Problem Manager
Azure
Pls See The Job Detail
Other
*ALSO OPEN TO UK-BASED CANDIDATES*
The Director of IT Incident and Problem Management is a senior leader responsible for shaping and transforming incident and problem management to a predictive and proactive discipline. You will drive a proactive, agile approach to incident response and problem management, building and leveraging AI-driven insights to enhance responsiveness and operational efficiency. Your leadership will underpin our pivot from a product to a platform-focused service, ensuring seamless, resilient service delivery that meets our high standards for reliability and customer satisfaction.
As a forward-thinking leader, you will balance traditional ITIL frameworks with modern tools and practices, such as incident.io and FireHydrant, and embed accountability across engineering and operational teams. You will work closely with cross-functional stakeholders including Engineering, Product, and Customer Support to ensure that incidents are resolved promptly and root causes are addressed comprehensively, with the overarching goal of minimizing business impact.
Qualifications
Strategic Incident and Problem Management Expertise: 10-15 years of experience in IT incident and problem management, ideally within SaaS and platform-based environments, with a minimum of 5 years in a senior leadership capacity.
Modern Practices in Incident Management: Demonstrated expertise in using cutting-edge incident management tools (e.g., incident.io, FireHydrant) and AI-driven solutions to streamline processes, drive rapid response, and enhance service reliability.
Problem Management: Expertise in leading comprehensive root cause analysis and problem resolution efforts, incorporating Google SRE principles for preventive actions.
Google SRE Methodologies: In-depth knowledge of Google SRE philosophies, including error budget management, service level indicators/objectives (SLIs/SLOs), and effective incident response strategies.
Platform and SaaS Experience: Strong understanding of platform-oriented operations within B2B SaaS, ideally with experience in supporting a pivot from product to platform. FinTech experience is advantageous but not required.
Leadership and Accountability: Proven record of building and leading high-performing teams, with an emphasis on holding teams accountable to clear standards and ensuring consistency in incident response and resolution.
Collaborative Communication Skills: Excellent ability to influence and collaborate with cross-functional teams and executive-level stakeholders. Skilled in delivering complex insights to both technical and non-technical audiences.
Innovation and Continuous Improvement: Ability to drive continuous improvement through innovative practices, data insights, and strategic thinking. An advocate for evolving incident/problem management to proactively support business goals.
Cross-cloud environments: Experience managing incident and problem resolution in cross-cloud environments, ideally with a focus on seamless integration of diverse platforms.
Plusses
Bachelor’s degree in Computer Science, Information Systems, or a related field; a Master’s degree is preferred.
ITIL Expert certification and familiarity with Google SRE principles; advanced certifications in cloud platforms (AWS, GCP, Azure) or incident management tools are highly advantageous.
Familiarity with leveraging AI and machine learning within incident and problem management to predict incidents, automate responses, or identify root causes, showcasing an ability to bring innovative solutions to the role.
1
....ience, Information Systems, or a related field; a Master’s degree is preferred.
ITIL Expert certification and familiarity with Google SRE principles; advanced certifications in cloud platforms (AWS, GCP, Azure) or incident management tools are highly advantageous.
Familiarity with le....
Tools: AWS, Azure, ITIL
Not Mandetory
724
12/12/2024
Reqroute, Inc
Startup / Job Consultancy
Startup
Job Consultancy
HR / Staffing
United States
United States
Major Incident Manager
Unknown
Major Incident Manager
ITIL
Pls See The Job Detail
Other
Role: Major Incident Manager (MIM)
Pay Rate $47/hr W2
Location: Remote
What are the top skills required for this role?
1. Major Incident Management
2. ServiceNow
3. Problem Management
4. MS Excel (Intermediate)
Job Description/ Responsibilities
• Deep understanding of ITSM and ITIL processes and frameworks
• Act as a single point of contact from infrastructure for major incidents
• Facilitate communication among teams, including technical, primary stakeholders and leadership groups
• Mobilize appropriate infrastructure resources/teams to diagnose and resolve the incident
• Coordinate efforts across multiple teams and support groups for faster restoration of services
• Ensure timely updates to key stakeholders, including status, impact and progress
• Oversee the technical troubleshooting process and drive towards faster recovery of services within agreed SLA timelines
• Ensure workarounds are implemented when necessary to minimize business disruptions
• Create post-incident reports (PIRs) to analyze root causes, impact, corrective actions (short and long-term) to aid in Problem Management
• Prepare, analyze and track metrics and insights to leadership to identify patterns or systemic issues
• Manage customer or client expectations during incident triages
• Excellent written, verbal communication and interpersonal skills
• Collaborate with Technology infrastructure teams to implement proactive measures to prevent issues
• Act as the face of the infrastructure during major incident triages
1
Tools: ITIL, MS Excel
777
12/12/2024
Harrison National Employment
Job Consultancy
Job Consultancy
HR / Staffing
United States
United States
Major Incident Manager
Unknown
Major Incident Manager
Monitoring Tools
Pls See The Job Detail
Other
Summary:
The Major Incident Manager role will report to the Vice President of Technology Operations. The Major Incident Management team handles coordination and communication for all service-impacting outages with the aim of minimizing their impact on downstream users and resolving them swiftly. This involves organizing and leading investigations in collaboration with technical teams throughout the organization, ranging from developers to customer support. Acting as the primary contact for all partners, the team ensures accurate and timely information dissemination to all stakeholders during significant outages. The team's contribution is crucial in delivering top-notch service to clients, and they take pride in their influence on the company's ongoing success. This role will also play a key role in Problem Management activities, focusing on root cause analysis and implementing remediation actions to prevent recurring issues.
Essential Job Duties and Responsibilities:
Works on moderate to complex issues where analysis of situations or data requires an in-depth knowledge of the application function, business units, or company.
Facilitate and drive the expedient resolution of an incident with leadership, control, and effective communication across multiple teams to limit overall customer impact.
Quickly assess the severity of an outage and be able to report business impact and technical complexity.
Maintain work stream documentation and data integrity within the incident.
Notify, escalate, and communicate to management, the existence and status of outages, as needed.
Identify areas of improvement during an incident that may need additional corrective action via problem records or follow ups.
Support and drive automation, elimination, and simplification of our current processes.
Participate in an on-call production support schedule.
Comply with all company policies and procedures.
Maintain regular and punctual attendance.
Other Job Duties and Responsibilities:
Performs other related duties as assigned.
Supervisory Responsibilities:
This position does not have direct reports
Qualifications:
To perform this job successfully, an individual must be able to perform each essential function satisfactorily. The requirements listed below are representative of the knowledge, skill, and/or ability required.
Strong written and verbal communication skills with the ability to effectively disseminate information across the organization, including senior leadership.
Experience with application or infrastructure services.
Excellent collaboration skills with an interest in developing a strong network.
Ability and aptitude to pick up new technologies or procedures.
Strong organization and decision-making skills with an emphasis on multitasking and the ability to execute during major outages.
Ability to translate situations in both a technical and business-friendly manner.
Ability to control and drive an incident call with several participants.
Innovation and drive - someone who is happy to challenge the status quo and collaborate on process improvements .
A self-starter who is assertive but diplomatic with the ability to prioritize workload and to perform in a high-pressure environment.
Basic understanding of industry standard technologies and infrastructure such as: networking, cloud storage, middleware, databases and virtual infrastructure.
ITIL Service Support knowledge.
Proficiency in MS Office Tools (Excel, Power Point, Teams, Power BI).
Education and/or Experience:
5+ years Operational Experience in an IT support role.
3+ years Incident Management/Coordination role.
2+ years experience with ITIL framework and production support best practices.
Experience with ITSM tools such as ServiceNow.
Experience with Enterprise Communication and Notification tools.
Working knowledge of Monitoring and Alerting tools (Splunk, Dynatrace, etc).