Wednesday, 16 December 2015

adop apply phase validation errors when abandon node


We have 4 nodes

CM(Admin server also),DB,MT,DMZ node

Due to some issues (unknown), prepare phase is failed in DMZ node.
We decided to abandon dmz node, apply the patches on remaining nodes (MT,CM) and then attach DMZ node again.

So, we started applying the patches using :
adop phase=apply patches=21613746,20979690,20805526,21127389,21164357,21035745,19390498,21902736 merge=yes patchtop=/staging/patches/ flags=autoskip

When applying patches adop checks if prepare is completed on all nodes.
it saw that its failed on DMZ node and asked if we want to continue to apply on other nodes and we responded Y
So, as part of validation, adop checks if the services enabled on DMZ node(abandoned node) will be served by MT,CM nodes after DMZ node is abandoned
(For some reason, s_formsstatus was enabled on our dmz node which is not supposed to be. And it is disabled on both CM and MT . (We use socket mode, so formsstatus need not be enabled)
ADOP sees s_formsstatus is enabled on DMZ but disabled on both MT and CM. So application cannot serve some services if DMZ is down.
So the validation was failing.


Here is were it checks :



/opt01/app/oracle/ERMR/fs_ne/EBSapps/log/adop/20/apply_20151029_221100/ERMR_ios11202e/TXK_EVAL_apply_Thu_Oct_29_22_23_14_2015

tail -f txkADOPEvalSrvStatus_Thu_Oct_29_22_23_14_2015.log

..
..
=============================
Inside evaluateSrvStatus()...
=============================

+ + + ACTIVE_NODE_GROUP_SERVICE_STATUS = s_batch_status,s_other_service_group_status,s_root_status,s_web_admin_status,s_web_applications_status,s_web_entry_status + + +

+ + + EBS_GROUP_SERVICE_STATUS         = s_batch_status,s_other_service_group_status,s_root_status,s_web_admin_status,s_web_applications_status,s_web_entry_status + + +

+ + + ACTIVE_NODE_SERVICE_STATUS = s_adminserverstatus,s_apcstatus,s_concstatus,s_forms-c4wsstatus,s_formsserver_status,s_jtffsstatus,s_nodemanagerstatus,s_oacorestatus,s_oaeastatus,s_oafmstatus,s_opmnstatus,s_
tnsstatus + + +

+ + + EBS_SERVICE_STATUS         = s_adminserverstatus,s_apcstatus,s_concstatus,s_forms-c4wsstatus,s_formsserver_status,s_formsstatus,s_jtffsstatus,s_nodemanagerstatus,s_oacorestatus,s_oaeastatus,s_oafmstatus,s
_opmnstatus,s_tnsstatus + + +


...

if you observer above, in ACTIVE_NODE_SERVICE_STATUS (services on MT and CM nodes) and EBS_SERVICE_STATUS (services on MT,CM,DMZ)
these should be equal, otherwise adop will quit. In this case it is  different and so, was failing

So, how does it compare :

adop downloads the context files of patch fs from all nodes and stores below during validation :
/opt01/app/oracle/ERMR/fs_ne/EBSapps/log/adop/20/apply_20151029_221100/ERMR_ios11202e/TXK_EVAL_apply_Thu_Oct_29_22_23_14_2015/ctx_files
Now, greps enabled services from these context files.

(adop runs the below command to download context files :
/opt01/app/oracle/ERMR/fs1/EBSapps/comn/util/jdk32/bin/java -classpath :/opt01/app/oracle/ERMR/fs1/EBSapps/comn/shared-libs/ebs-3rdparty/WEB-INF/lib/ebs3rdpartyManifest.jar:/opt01/app/
oracle/ERMR/fs1/FMW_Home/oracle_common/modules/oracle.jdbc_11.1.1/ojdbc6dms.jar:/opt01/app/oracle/ERMR/fs1/FMW_Home/oracle_common/modules/oracle.xdk_11.1.0/xmlparserv2.jar:/opt01/app/oracle/ERMR/fs1/FMW_Home/or
acle_common/modules/oracle.odl_11.1.1/ojdl.jar:/opt01/app/oracle/ERMR/fs1/FMW_Home/oracle_common/modules/oracle.dms_11.1.1/dms.jar:/opt01/app/oracle/ERMR/fs1/EBSapps/comn/java/classes oracle.apps.ad.autoconfig.
oam.CtxSynchronizer action=downloadall downloaddir=/opt01/app/oracle/ERMR/fs_ne/EBSapps/log/adop/20/apply_20151029_221100/ERMR_ios11202e/TXK_EVAL_apply_Thu_Oct_29_22_23_14_2015/ctx_files fileeditiontype=patch c
ontextfile=/opt01/app/oracle/ERMR/fs2/inst/apps/ERMR_ios11202e/appl/admin/ERMR_ios11202e.xml promptmsg=hide logfile=/opt01/app/oracle/ERMR/fs_ne/EBSapps/log/adop/20/apply_20151029_221100/ERMR_ios11202e/TXK_EVAL
_apply_Thu_Oct_29_22_23_14_2015/CtxSynchronizer.log )


So, how to fix this :

Disable the forms in dmz node which is to be abandoned. Update the context file to disable forms
As it is patch file system context file, update the patch file system context file and update to DB using the below command :

$AD_TOP/bin/adconfig.sh contextfile=/opt01/app/oracle/fs2_context_file.xml -syncctx

This will udpate the db with context file from fs2.

Then ran adop again after which it went on to apply patches on MT and CM nodes after abandoning DMZ node.

Saturday, 26 September 2015

E-Business suite 12.2 adop issues on AIX 6.1 on NFS mount

We have done new 12.2 installation .And applied 12.2.4 patch
After 12.2.4 patch is applied, before applying "Section 8: Apply Additional Critical Patches" , we have run adop phase=prepare .

And it errors with this message :
=============================
Could not find the pattern...
=============================

Executing SYSTEM command: perl /opt02/app/oracle/bisapp/fs1/EBSapps/comn/adopclone_server1/bin/adclone.pl java=/opt02/app/oracle/bisapp/fs1/EBSapps/comn/adopclone_server1/FMW/t2pjdk mode=fscloneapply stage=/opt02/app/orac
le/bisapp/fs1/EBSapps/comn/adopclone_server1 component=ohsConfig appctx=/opt02/app/oracle/bisapp/fs1/inst/apps/qa_server1/appl/admin/qa_server1.xml appctxtg=/opt02/app/oracle/bisapp/fs2/inst/apps/RBISQA_ios190
1e/appl/admin/qa_server1.xml
EXIT STATUS: 1

======================================
Inside copyCloneLogsToFSNE()...
======================================

Creating the directory: /opt02/app/oracle/bisapp/fs_ne/EBSapps/log/adop/5/prepare_20150226_065021/qa_server1/TXK_SYNC_migrate_Thu_Feb_26_07_23_03_2015/ohsConfig_apply

Copying the directory
---------------------
SOURCE : /opt02/app/oracle/bisapp/fs1/inst/apps/qa_server1/admin/log/clone
TARGET : /opt02/app/oracle/bisapp/fs_ne/EBSapps/log/adop/5/prepare_20150226_065021/qa_server1/TXK_SYNC_migrate_Thu_Feb_26_07_23_03_2015/ohsConfig_apply

/opt02/app/oracle/bisapp/fs1/EBSapps/comn/adopclone_server1/bin/adclone.pl did not go through successfully.
LOG DIRECTORY: /opt02/app/oracle/bisapp/fs_ne/EBSapps/log/adop/5/prepare_20150226_065021/qa_server1/TXK_SYNC_migrate_Thu_Feb_26_07_23_03_2015/ohsConfig_apply.
*******FATAL ERROR*******
PROGRAM : (/opt02/app/oracle/bisapp/fs1/EBSapps/appl/ad/12.0.0/patch/115/bin/txkADOPPreparePhaseSynchronize.pl)
TIME : Thu Feb 26 08:02:04 2015
FUNCTION: main::migrateCloneComponentApply [ Level 1 ]


========================================================
The issue was during the prepare phase creates several temp files and also deletes them after the jobs are  done. When files were created, NFS server creates hidden files (.nfs***) files and when adop process tries to remove them, it was not able to remove them. And so, during recreation of the directories as the files/directories are already there , creation of the files/dir fails and so adop exits.

We have used several work arounds to avoid this issues. They worked most of the time but did nto worked 100% of the time.


Work around(s) used :

1) Continuos monitoring and copy bin and jre if required to /FMW/t2pjdk
2) When it fails delete or move t2pjdk and re kick off adop cycle
3) txkADOPPreparePhaseSynchronize.pl updated with below vlaue

$java_loc = $stage_loc . TXK::OSD->trDirPathToBase('/FMW/t2pjdk/jdk64');

These are the findings at last :
- .nfs files are created by NFS server and  when rm command is triggered and if the files are in shared library, adop process could not delete them.
- This is already handled by the following env variable setting for 32- bit programs (this is pre req for 12.2 installation)
LDR_CNTRL=PRIVSEG_LOADS@MAXDATA=0xB0000000@DSA
- However, LDR_CTRL env variable works only for 32 bit programs and will not work for 64-bit programs. This has to be provided by IBM.

Issue is ,Unfortunately the NFS team states "There is no such
NFS feature to unload shared library at user level".
And so, what is the solution ?????

There is no solution yet but there is a workaround :
slibclean

We constantly run slib clean during prepare phase. i.e. every 5 sec by cron job
Script used:
==
while (true); do
sudo -u root /etc/slibclean
sleep 5
done
==

This resolved the issue for now.

We dont have that issue now after the work around.
Oracle and IBM are not ready yet to provide solution on this...




We also had multiple other issues with adop on AIX. (ex: during fs_clone)
Note : These issues are specific to AIX on NFS mount (GPFS file system)






   
  note