Building potential patent portfolios:
An integrated approach based on topic
identification and correlation analysis
Xian ZHANG1,2, Haiyun XU1, Shu FANG1, Zhengyin HU1,2 & Shuying LI2
1Chengdu Library, Chinese Academy of Sciences, Chengdu 610041, China
2University of Chinese Academy of Sciences, Beijing 100190, China
):39-51, Received: Apr. 13, 2015
Revised: Jul. 27, 2015
Accepted: Jul. 29, 2015
This work is jointly supported by the Science and Technology Service Network Initiative of Chinese
Academy of Sciences (Grant No.: KFJ-EW-STS-032), the West Light Foundation of Chinese Academy
of Sciences (Grant No.: Y4C0091001), and the National Social Science Foundation of China (Grant
No.: 14CTQ033). This paper is based on a presentation at the conference of the 4th Global TechMining
Conference, Stadsgehoorzaal, Leiden, the Netherlands.
X. Zhang (email@example.com, corresponding author) was responsible for the overall research
design, proposed the analysis framework, wrote the paper outline and revised the paper. H.Y. Xu
(firstname.lastname@example.org) designed the patent analysis framework, performed data analysis, made the figures and tables and revised the paper. S. Fang (email@example.com) proposed the research topic
and revised the paper. Z.Y. Hu (firstname.lastname@example.org) completed the details of patent analysis
framework and S.Y. Li (email@example.com) searched information for related study and edited
"Purpose: This paper suggests a framework to identify important patents for building potential patent portfolios based on patents owned by different assignees so as to highlight the value of individual patents in technology transfer and identify potential collaborators for patent assignees.
Design/methodology/approach: The analysis framework includes the following steps: 1) co-classification analysis based on the International Patent Classification (IPC) codes and Derwent Manual Codes (DMC) to detect sub-tech fields, 2) keyword co-occurrence analysis aiming to understand the core technology information in each patent, and 3) social network analysis used for identifying important technologies and partnerships of key assignees. A case study was conducted with 27,401 chemistry patents filed by a Chinese national research institute.
Findings: The results show that this framework is effective in building potential technological patent portfolios based on patents owned by different assignees and identifying future collaborators for the assignees. This integrated approach based on topic identification and correlation analysis that combines network-based analysis with keyword-based analysis can reveal important patented technologies and their connections and help understand detailed technological information mentioned in patents.
Research limitations: In keywords analysis, only titles and abstracts of patent documents were used and weights of keywords in different parts of the documents were not considered.
Practical implications: The analysis framework provides valuable information for decisionmakers of large institutions which have many patents with broad application prospects.
Originality/value: Different from previous patent portfolio studies based on the use of a combination of patent analysis indicators, this study provides insights into a method of building patent portfolios to discover the potential of individual patents in technology transfer and promote cooperation among different patent assignees."
Companies apply for patents to obtain the exclusive right to use, sell and offer to
sell the patented technologies for a limited period of time. Brockhoff 
the concept of patent portfolio, which is the collection of patents owned by an
individual or a company. In order to enhance global competitiveness, companies
have stepped up their efforts in patent management through patent portfolio analysis.
For large enterprises or organizations which have a lot of subsidiary companies,
however, it is a challenging task to build potential patent portfolios based on patents
owned by different assignees and promote collaboration between these firms.
Current studies on patent portfolio focused primarily on the analysis related to
one patent holder and few considered the ways of building potential portfolios of
patents held by different assignees
. Using a case study of a large research institute
in China, this paper intends to suggest an analysis framework for enterprises and
organizations to build potential patent portfolios based on patents of different
assignees so as to reveal the value of individual patents in technology transfer and
identify potential collaborators for these assignees.
2 Related study
One of the major approaches to identification of important patents is co-occurrence
analysis based on structured information such as patent classification codes,
assignees, or citations. Researchers tried to use the co-occurrence relationships of
patent classification codes to conduct co-classification analysis. Like co-citation and
co-word analysis, it is assumed that classification codes represent cognitive elements
associated with topics, specialities, or fields
and analysis of co-occurrence of
patent classification codes can measure technological distance, with patents in a
given patent category being considered more similar to one another than to those in
other patent categories[4,5]
. However, one limitation of co-classification analysis is that the classification may be too broad to fully capture all the technologies within
a category and meet the goal for a particular analysis.
Patent citations refer to the count of citations of a patent in subsequent patents,
and citations of a patent can be used to measure the relative importance of the
. A lot of studies[7,8,9,10,11,12,13]
used citation counts to evaluate the quality and impact
of the cited technologies and the relationships between cited and citing technological
areas. However, patent citation approach may lead to inaccurate results as the
applicants and examiners may have different purposes and motivations when citing
patents and thus citations may not reflect the association between technologies
With the development of data mining techniques, analyzing the unstructured data
in patent documents such as titles and abstracts has become possible. Yoon & Park
extracted keywords through text mining and performed keyword-based morphology
analysis. Tseng et al.
discussed text mining techniques such as text segmentation,
term association, and topic mapping technique and their empirical study found that
patent abstract and general summary and summaries from each section in a patent
document are the most topic-relevant sections.
Social network analysis (SNA) is applied in identification of core technologies
and detection of technological association. For instance, Yoon & Park
technology distance with the Euclidian distance of keywords vectors and built
patent association networks. Luan
built the evolution network of solar energy
technologies and identified key technologies with betweenness centrality. However,
relatively little research has been conducted in using SNA approach to build patent
portfolios and find collaborators for different patent assignees.
The suggested analysis framework is illustrated in Fig. 1, which involves three
• Data preparation. This module includes selection of a patent technology
field and appropriate databases for data retrieval. Keywords are first extracted
from patent titles and abstracts with Thomson Data Analyzer (TDA)①.
Standardization work is then conducted for the extracted keywords, which
involves using a singular form, unification of synonyms, avoidance of hyphen and abbreviations and standardization of phrases. For example, “fuel battery”
and “fuel cell” are standardized into “fuel battery”.
① Thomson Data Analyzer (TDA) can be used for patent data mining and visualization. Details are available at http://thomsonreuters.com/en/products-services/intellectual-property/patent-research-and-analysis/thomson-data-analyzer.html.
• Network construction and visualization.
We first construct a technology
network based on co-occurrence of the IPC codes and DMCs, respectively, and
then a keyword co-occurrence network based on keywords extracted from
patent titles and abstracts. The network-based patent analysis gives us the
overall situation of a relevant technical field, including important patents and
the relationships between patents from a viewpoint of the network, while the
keyword-based patent analysis enables us to understand the core technology
information within patents on the basis of analysis of patent content. To identify
the unique characteristics of the patent technology network and keyword
network, such SNA indicators as E-I index
and centrality measures
employed. Girvan-Newman algorithm
is employed for clustering and
(version 6.216) is used as the software tool for the
analysis and visualization of data.
• Data interpretation. With the participation of technological experts, the
results are interpreted so as to build possible patent portfolios and find potential
cooperation opportunities for patent assignees.
The analysis framework.
3.2 Key steps of analysis process
3.2.1 Construction of patent technology network
It is possible that each patent is given more than one classification code, which
means this patent is involved in several technology fields. We used TDA tool
(version 5.1) to build a co-occurrence matrix of IPC codes and DMCs, respectively. The reason why we use both IPC codes and DMCs is that IPC is more functionoriented
rather than application-oriented, which means that IPC codes can be hardly
matched with the classified industry technology in reality
. By comparison,
Derwent manual codes provided by Derwent innovation index (DII) are applicationoriented.
As a result, using both IPC codes and DMCs to build the technology
network will give us a comprehensive view of technology fields or areas.
We refer to Tseng et al.’s study
and the similarity between patent J
could be measured by calculating the cosine angle between them following
Where each patent is regarded as a vector space with the collections of n IPC
codes, IPCij denotes the occurrence frequency of the ith IPC code in patent J, and
IPCik denotes the occurrence frequency of the ith IPC code in patent K.
3.2.2 Construction of keyword co-occurrence network
As noun and noun phrases in a sentence express the main information of a sentence
according to linguistics, the title and abstract of a patent document contain the main
information of the patented technology
. Thus, keywords which are nouns and
noun phrases are extracted from patent documents and are used to construct the
keyword co-occurrence network.
Each patent document is transformed into a vector by the frequency of the
keywords’ occurrence. Words with co-occurrence frequency over 100 times are
considered to be closely connected with a specific technology and adopted as
keywords. The association strength between two nodes is calculated by applying the
cosine measure following Eq. (2).
Where Termij denotes the occurrence frequency of the ith keyword in patent J,
and Termik denotes the occurrence frequency of the ith keyword in patent K.
3.2.3 Patent portfolio analysis with SNA indicators
The external internal index (E-I index)
, which is the ratio of the density of
subgroup against that of the entire network, is used to measure the internal or
external relationships between different technology fields. The possible values of
E-I index range from –1 to 1. A value close to 1 indicates external relationships,
while a value close to –1 indicates internal relationships
. In addition, betweenness centrality is used to identify important technologies in the keyword co-occurrence
4 Empirical study
4.1 Data collection
We selected Derwent Innovation Index Database (DII) as the data source. DII
merges the value-added patent information from Derwent World Patents Index with
the patent citation information from Derwent Patent Citation Index. As the largest
patent database in the world, DII covers over 14.3 million basic inventions from
40 worldwide patent-issuing authorities, including chemistry, electrics, electronics
and mechanical engineering. Data in DII can be traced back to 1963, with over
30,000 patent records added in weekly. In DII, patent information can be retrieved
according to patent code, assignee, assignee code and inventor and patents are
assigned IPC codes, Derwent class codes, and Derwent manual codes.
According to the World Intellectual Property Organization (WIPO)’s IPC8-
Technology Concordance Table revised in January, 2013, the concept of technology
classification is divided into 5 areas, including electric engineering, apparatus,
chemistry, mechanical engineering and other fields, and also can be divided into 35
We retrieved data on November 20, 2013 with IPC codes③ of the chemistry
patents filed by a Chinese national research institute and its branch institutes during
1985 and 2012. For this national research institute, chemistry is the area in which
it owns the most patent applications. We retrieved totally 27,401 chemistry records.
As enough data is necessary for building patent portfolios, this paper selects
chemistry patents as our data sample.
③ IPC codes are listed in Appendix I.
4.2.1 Network based patent analysis
Figure 2 displays the co-classification network of IPC codes. Based on Girvan-
Newman algorithm, we found 27,401 patents in chemistry are related to 11 fields.
The EI index value is –0.094, which means the 11 fields have maintained relatively
Figure 2 illustrates the internal and external relationships of the subgroups. The
purple line indicates the internal relations and the green line the external relations.
More internal relationships are observed from Fig. 2, especially in fields of chemical
engineering, pharmaceuticals and organic fine chemistry. But the fields of food chemistry, micro-structural and nano-technology, and surface technology and
coating are found to be weakly connected.
Patent co-classifi cation network based on IPC codes.
There are 188 subclasses in Derwent manual code classification, with clear
hierarchy and long-term stability
. In our case study, this large research institute’s
patents in chemistry cover 182 subclasses. The clustering and visualization of the
patents based on Girvan-Newman algorithm is shown in Fig. 3, which can be
classified into 6 technological fields: 1) polymer/plastics, 2) general chemistry, 3)
catalyst, 4) pharmaceuticals, 5) agriculture, 6) semiconductors, circuits, fireproof
materials, ceramics, cement and electrochemistry. The threshold value of the network
is set at 0.05 for satisfying visualization result.
Patent co-classifi cation network based on Derwent manual codes.
Polymerid/Plasdocer is located in the center of the network. The patents in this
technology field are mainly about monomer, concentration, polymerization, natural
polymer, addition polymer, condensation polymer, inorganic polymer, polymer
blending, aqueous dispersion, additive, property, analyzing, testing, controlling,
polymerization, polymer modification and polymer processing. Polymer processing
contains devices, materials and preparation method of polymer application and
plastics. Polymer application which has the highest betweenness centrality is
identified as the core node of the whole network.
4.2.2 Keyword-based network construction
Using the method in Section 3.2.2, we constructed the patent keyword network. The
keywords with high betweenness degrees such as polymer application, polymerization
reaction and fermentation industry play a bridge role in connecting other patents in
the network. These important technologies are usually in the core positions in the
network and can be considered key technologies. The keywords network of patents
in polymer application is illustrated in Fig. 4.
The possible patent portfolios of polymer applications.
4.2.3 Portfolios possibility analysis
In view of core nodes in the network, we analyzed the possibilities of building
patent portfolios on the 409 patents in polymer application. Figure 4 shows the
co-occurrence network of keywords with frequencies of occurrence more than
100 times in patent documents of polymer application. Our findings indicate that
this large research institute’s polymer application patents can be roughly summarized
into six subject areas: 1) biological polymers, 2) synthetic resin, 3) conductive
polymers, 4) engineering plastics, 5) fertilizers and 6) polyamide system.
We observed close relationships between the six subject areas, so a comprehensive
protection solution is recommended. For example, when the same technology is
applied to different fields, these different applications may be considered as a patent
portfolio. Second, we suggest patent portfolio protection for diversified technologies
and their applications within one of the six subject areas. For instance, in synthetic
resin area, the preparation of phenolic resin contains various solvents (acetone,
ethyl acetate, component solvent, etc.) and strengthened materials (polycarbafil,
activated charcoal, etc.). As a result, a variety of preparation and materials of
phenolic resin can be considered as a patent portfolio, which also highlights the
value of individual patents.
4.2.4 Potential cooperation analysis
This paper selected top 10 assignees of polymer patent applications of this national
institute and the keywords of these applications to perform a two-mode co-occurrence
network as illustrated in Fig. 5. We detected multiple technology cooperation areas
for these patent assignees.
Potential partners in polymer application.
In Fig. 5, the red square nodes represent the top-ranked patent assignees. The blue
circle nodes represent the key patent technologies. The nodes’ size is proportional
to the value of the betweenness centrality of the nodes. The lines between these
nodes indicate the relations between the assignees and the patented technologies.
The thicker the line, the stronger is the linkage.
As revealed in Fig. 5, A, B and C possess multiple important patents and
technologies in polymer application. As a result, there is a potential cooperation opportunity for these institutes. We discuss some of the most promising cooperation
opportunities in details:
• A, B, D and F all focus their research on synthesis and application of phenolic
resin and other resins;
• A, B, C, F and G have filed many patents on the preparation and application
of polyethylene glycol;
• B, D and E have patent filings in preparation and application of chitosan;
• A and E both have applied patents in aluminium metal materials and
• B and C are both involved in the preparation and application of coating
• A, C and F all have applied patents in polyvinyl alcohol preparation and
5 Discussions and conclusions
The previous studies on patent portfolios were based on the use of a combination
of patent analysis indicators and focused primarily on an integration of multidimensional
indicator system for exploring the patent assets of a certain assignee.
This paper suggests a framework to provide insights into building potential patent
portfolios on the technology level, especially focusing on the patents owned
by different assignees and identifying potential collaborators. More specifically, the
integrated approach based on topic identification and correlation analysis that
combines network-based analysis with keyword-based analysis can reveal the
overall situation of a relevant technological field and the content of the patent
technologies at the same time.
This research needs to be improved in several aspects. For instance, in order to
reveal the value of individual patents for technology transfer, we need to perform
analysis at a more detailed level of granularity. To this end, we may consider the
weight of words extracted from patent titles and abstracts or extract keywords from
patent summary section in addition to titles and abstracts for keywords analysis. In
addition, more case studies are needed to be conducted in different technological
fields in order to verify the effectiveness of the suggested framework. We will
address these issues in our future research.
WIPO technology concordance in chemistry
Brockhoff, K.K. Indicators of firm patent activities. In Technology Management: The New International Language. Portland International Conference on Management of Engineering and Technology. Washington, DC: IEEE Computer Society, 1991: 476-481.
Yue, X.P. Inner-firm patent portfolio scale strategy: Based on cournot model. Journal of Intelligence (in Chinese), 2012, 31(11): 118-122, 135. Retrieved on May 14, 2015 from http://d.wanfangdata.com.cn/Periodical_qbzz201211025.aspx
Spasser, M.A. Mapping the terrain of pharmacy: Co-classification analysis of the international pharmaceutical abstracts database. Scientometrics, 1997, 39(1): 77-97. Retrieved on May 14,2015, from http://link.springer.com/article/10.1007/BF02457431
Jaffe, A. Technological opportunity and spillovers of R&D: Evidence from firms' patents, profits, and market value. American Economic Review, 1986, 76(5), 984-1001.
Kauffman, S., Lobo, J., & Macready, W.G. Optimal search on a technology landscape. Journal of Economic Behavior & Organization, 2000, 43(2): 141-166. Retrieved on May 14, 2015, from http://www.sciencedirect.com/science/article/pii/S0167268100001141
Lee, S., Yoon, B., & Park, Y. An approach to discovering new technology opportunities: Keyword-based patent map approach, Technovation, 2009, 29 (6/7): 481-497. Retrieved on May 14, 2015, from http://www.sciencedirect.com/science/article/pii/S0166497208001326
Trajtenberg, M. A penny of your quotes: Patent citations and the value of innovations. RAND Journal of Economics, 1990, 21(1): 172-187. Retrieved on May 14, 2015, from http://www.jstor.org/stable/2555502?seq=1#page_scan_tab_contents
Narin, F. Patents as indicators for the evaluation of industrial research output. Scientometrics,1995, 34 (3): 489-496. Retrieved on May 14, 2015, from http://link.springer.com/article/10.1007%2FBF02018015
Lanjouw, J.O., & Schankerman, M.A. Characteristics of patent litigation: A window on competition. RAND Journal of Economics, 2000, 32 (1): 129-151. Retrieved on May 14,2015, from http://papers.ssrn.com/sol3/papers.cfm?abstract_id=252299
Harhoff, D., Scherer, F.M., & Vopel, K. Citations, family size, opposition and the value of patent rights. Research Policy, 2003, 32(8): 1343-1363. Retrieved on May 14, 2015, from http://www.sciencedirect.com/science/article/pii/S0048733302001245
Harhoff, D., & Reitzig, M. Determinants of opposition against EPO patent grants - The case of biotechnology and pharmaceuticals. International Journal of Industrial Organization, 2004,22: 443-480.
Haupt, R., Kloyer, M., & Lange, M. Patent indicators for the technology life cycle development. Research Policy, 2007, 36(3): 387-398. Retrieved on May 14, 2015,from http://www.sciencedirect.com/science/article/pii/S0048733307000054
Wang, X., Zhang, X., & Xu, S. Patent co-citation networks of Fortune 500 companies. Scientometrics, 2011, 88(3): 761-770. Retrieved on May 14, 2015, from http://link.springer.com/article/10.1007%2Fs11192-011-0414-x
Kraslawski, A. Semantic analysis for identification of portfolio of R&D projects. Example of microencapsulation. The 16th
European Symposium on Computer Aided Process Engineering and 9th
International Symposium on Process Systems Engineering, 2006, 21: 1905-1910. Retrieved on May 14 2015, from http://www.sciencedirect.com/science/article/pii/S1570794606803263
Yoon, B., & Park, Y. Development of new technology forecasting algorithm: Hybrid approach for morphology analysis and conjoint analysis of patent information. IEEE Transactions on Engineering Management, 2007, 54 (3): 588-599. Retrieved on May 14, 2015, from http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=4278022
Tseng, Y.H., Lin, C.J., & Lin, Y.I. Text mining techniques for patent analysis, Information Processing & Management, 2001, 43: 1216-1247. Retrieved on May 14, 2015, from http://www.sciencedirect.com/science/article/pii/S0306457306002020
Yoon, B., & Park, Y. A text-mining-based patent network: Analytical tool for high-technology trend. The Journal of High Technology Management Research, 2004, 15(1):37- 50. Retrieved on May 14, 2015, from http://www.sciencedirect.com/science/article/pii/S1047831003000439
Luan, C.J. Mapping the evolution of technology network and identifying key technologies in the field of solar energy technology via co-occurrence analysis. Journal of the China Society for Scientific and Technical Information (in Chinese), 2013, 32(1): 68-79. Retrieved on May 14, 2015, from http://d.wanfangdata.com.cn/Periodical_qbxb201301009.aspx
Krackhardt, D., & Stern R.N. Informal networks and organizational crises: An experimental simulation. Social Psychology Quarterly, 1988, 51(2): 123-140. Retrieved on May 14, 2015, from http://www.jstor.org/stable/2786835?origin=crossref&seq=1#page_scan_tab_contents
Liu, J. An introduction to social network analysis (in Chinese). Beijing: Social Sciences Academic Press, 2004.
Newman, M.E.J., & Girvan, M. Finding and evaluating community structure in networks. Physical Review E, 2004, 69: 026113. Retrieved on May 14, 2015, from http://journals.aps.org/pre/abstract/10.1103/PhysRevE.69.026113
Wang, B., Liu, S., & Ding, K., et al. Identifying technological topics and institution-topic distribution probability for patent competitive intelligence analysis: A case study in LTE technology. Scientometrics, 2014, 101(1): 685-704. Retrieved on May 14, 2015, from http://link.springer.com/article/10.1007%2Fs11192-014-1342-3
Thomson Reuters. DWPI classification system. Retrieved on May 14, 2015, from http://ip-science.thomsonreuters.com/support/patents/dwpiref/reftools/classification/